Videos capture auditory and visual signals, each conveying distinct events. Simultaneously analyzing these multimodal signals enhances human comprehension of the video content. We focus on the audio-visual video parsing task, in which we integrate auditory and visual cues to identify events in each modality and pinpoint their temporal boundaries. Since fine-grained segment-level annotation is labor-intensive and time-consuming, only video-level labels are provided during the training phase. Labels and timestamps for each modality are unknown. A prevalent strategy is to aggregate audio and visual features through cross-modal attention and further denoise video labels to parse events within video segments in a weakly supervised manner. However, these denoised labels have limitations: they are restricted to the video level, and segment-level annotations remain unknown. In this paper, we propose a semantic dictionary description method for audio-visual video parsing, termed SDDP (Semantic Dictionary Description for video Parsing), which uses a semantic dictionary to explicitly represent the content of video segments. In particular, we query the relevance of each segment with semantic words from the dictionary and determine the pertinent semantic words to redescribe each segment. These redescribed segments encode event-related information, facilitating cross-modal video parsing. Furthermore, a pseudo label generation strategy is introduced to convert the relevance of semantic dictionary queries into segment-level pseudo labels, which provide segment-level event information to supervise event prediction. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior performance compared with state-of-the-art methods.