Segment-level event perception with semantic dictionary for weakly supervised audio-visual video parsing

被引:0
|
作者
Xie, Zhuyang [1 ,2 ]
Yang, Yan [1 ,2 ,3 ]
Yu, Yankai [1 ,2 ]
Wang, Jie [1 ,2 ]
Liu, Yan [1 ,2 ]
Jiang, Yongquan [1 ,2 ,3 ]
机构
[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu 611756, Peoples R China
[2] Minist Educ, Engn Res Ctr Sustainable Urban Intelligent Transpo, Beijing, Peoples R China
[3] Natl Engn Lab Integrated Transportat Big Data Appl, Chengdu 611756, Peoples R China
关键词
Audio-visual video parsing; Weakly supervised; Multimodal learning; Pseudo label; ACTION RECOGNITION;
D O I
10.1016/j.knosys.2024.112884
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Videos capture auditory and visual signals, each conveying distinct events. Simultaneously analyzing these multimodal signals enhances human comprehension of the video content. We focus on the audio-visual video parsing task, in which we integrate auditory and visual cues to identify events in each modality and pinpoint their temporal boundaries. Since fine-grained segment-level annotation is labor-intensive and time-consuming, only video-level labels are provided during the training phase. Labels and timestamps for each modality are unknown. A prevalent strategy is to aggregate audio and visual features through cross-modal attention and further denoise video labels to parse events within video segments in a weakly supervised manner. However, these denoised labels have limitations: they are restricted to the video level, and segment-level annotations remain unknown. In this paper, we propose a semantic dictionary description method for audio-visual video parsing, termed SDDP (Semantic Dictionary Description for video Parsing), which uses a semantic dictionary to explicitly represent the content of video segments. In particular, we query the relevance of each segment with semantic words from the dictionary and determine the pertinent semantic words to redescribe each segment. These redescribed segments encode event-related information, facilitating cross-modal video parsing. Furthermore, a pseudo label generation strategy is introduced to convert the relevance of semantic dictionary queries into segment-level pseudo labels, which provide segment-level event information to supervise event prediction. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior performance compared with state-of-the-art methods.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
    Xue, Cheng
    Zhong, Xionghu
    Cai, Minjie
    Chen, Hao
    Wang, Wenwu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 418 - 429
  • [42] Extracting semantic information from basketball video based on audio-visual features
    Kim, K
    Choi, J
    Kim, N
    Kim, P
    IMAGE AND VIDEO RETRIEVAL, 2002, 2383 : 278 - 288
  • [43] Enhancing semantic audio-visual representation learning with supervised multi-scale attention
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Qi, Guojun
    Wu, Haiyuan
    Hachiya, Hirotaka
    PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)
  • [44] Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation
    Wu, Renjie
    Wang, Hu
    Dayoub, Feras
    Chen, Hsiang-Ting
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6100 - 6108
  • [45] CSS-Net: A Consistent Segment Selection Network for Audio-Visual Event Localization
    Feng, Fan
    Ming, Yue
    Hu, Nannan
    Yu, Hui
    Liu, Yuanan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 701 - 713
  • [46] Detection of music segment boundaries using audio-visual features for a personal video recorder
    Otsuka, Isao
    Suginohara, Hidetsugu
    Kusunoki, Yoshiaki
    Divakaran, Ajay
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2007, 53 (01) : 150 - 154
  • [47] Detection of music segment boundaries using audio-visual features for a personal video recorder
    Otsuka, Isao
    Suginohara, Hidetsugu
    Kusunoki, Yoshiaki
    Divakaran, Ajay
    ICCE: 2007 DIGEST OF TECHNICAL PAPERS INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, 2007, : 155 - +
  • [48] The effect of combined sensory and semantic components on audio-visual speech perception in older adults
    Maguinness, Corrina
    Setti, Annalisa
    Burke, Kate E.
    Kenny, Rose Anne
    Newell, Fiona N.
    FRONTIERS IN AGING NEUROSCIENCE, 2011, 3 : 1 - 9
  • [49] Event-related potentials associated with somatosensory effect in audio-visual speech perception
    Ito, Takayuki
    Ohashi, Hiroki
    Montas, Eva
    Gracco, Vincent L.
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 669 - 673
  • [50] Semantic Analysis of Field Sports Video using a Petri-Net of Audio-Visual Concepts
    Bai, Liang
    Lao, Songyang
    Smeaton, Alan F.
    O'Connor, Noel E.
    Sadlier, David
    Sinclair, David
    COMPUTER JOURNAL, 2009, 52 (07): : 808 - 823