Segment-level event perception with semantic dictionary for weakly supervised audio-visual video parsing

被引:0
|
作者
Xie, Zhuyang [1 ,2 ]
Yang, Yan [1 ,2 ,3 ]
Yu, Yankai [1 ,2 ]
Wang, Jie [1 ,2 ]
Liu, Yan [1 ,2 ]
Jiang, Yongquan [1 ,2 ,3 ]
机构
[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu 611756, Peoples R China
[2] Minist Educ, Engn Res Ctr Sustainable Urban Intelligent Transpo, Beijing, Peoples R China
[3] Natl Engn Lab Integrated Transportat Big Data Appl, Chengdu 611756, Peoples R China
关键词
Audio-visual video parsing; Weakly supervised; Multimodal learning; Pseudo label; ACTION RECOGNITION;
D O I
10.1016/j.knosys.2024.112884
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Videos capture auditory and visual signals, each conveying distinct events. Simultaneously analyzing these multimodal signals enhances human comprehension of the video content. We focus on the audio-visual video parsing task, in which we integrate auditory and visual cues to identify events in each modality and pinpoint their temporal boundaries. Since fine-grained segment-level annotation is labor-intensive and time-consuming, only video-level labels are provided during the training phase. Labels and timestamps for each modality are unknown. A prevalent strategy is to aggregate audio and visual features through cross-modal attention and further denoise video labels to parse events within video segments in a weakly supervised manner. However, these denoised labels have limitations: they are restricted to the video level, and segment-level annotations remain unknown. In this paper, we propose a semantic dictionary description method for audio-visual video parsing, termed SDDP (Semantic Dictionary Description for video Parsing), which uses a semantic dictionary to explicitly represent the content of video segments. In particular, we query the relevance of each segment with semantic words from the dictionary and determine the pertinent semantic words to redescribe each segment. These redescribed segments encode event-related information, facilitating cross-modal video parsing. Furthermore, a pseudo label generation strategy is introduced to convert the relevance of semantic dictionary queries into segment-level pseudo labels, which provide segment-level event information to supervise event prediction. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior performance compared with state-of-the-art methods.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Weakly Supervised Temporal Action Localization with Segment-Level Labels
    Ding, Xinpeng
    Wang, Nannan
    Li, Jie
    Gao, Xinbo
    PATTERN RECOGNITION AND COMPUTER VISION, PT I, 2021, 13019 : 42 - 54
  • [22] Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
    Gao, Junyu
    Chen, Mengyuan
    Xu, Changsheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18827 - 18836
  • [23] Distributed Semantic Communications for Multimodal Audio-Visual Parsing Tasks
    Wang, Penghong
    Li, Jiahui
    Liu, Chen
    Fan, Xiaopeng
    Ma, Mengyao
    Wang, Yaowei
    IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, 2024, 8 (04): : 1707 - 1716
  • [24] Toward a perceptive pretraining framework for Audio-Visual Video Parsing
    Wu, Jianning
    Jiang, Zhuqing
    Chen, Qingchao
    Wen, Shiping
    Men, Aidong
    Wang, Haiying
    INFORMATION SCIENCES, 2022, 609 : 897 - 912
  • [25] Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion Recognition
    Hsu, Jia-Hao
    Wu, Chung-Hsien
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (04) : 3231 - 3243
  • [26] Cross-Modal learning for Audio-Visual Video Parsing
    Lamba, Jatin
    Abhishek
    Akula, Jayaprakash
    Dabral, Rishabh
    Jyothi, Preethi
    Ramakrishnan, Ganesh
    INTERSPEECH 2021, 2021, : 1937 - 1941
  • [27] Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
    Lai, Yung-Hsuan
    Chen, Yen-Chun
    Wang, Yu-Chiang Frank
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [28] Audio-Visual Weakly Supervised Approach for Apathy Detection in the Elderly
    Sharma, Garima
    Joshi, Jyoti
    Zeghari, Radia
    Guerchouche, Rachid
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [29] Weakly Supervised Representation Learning for Audio-Visual Scene Analysis
    Parekh, Sanjeel
    Essid, Slim
    Ozerov, Alexey
    Ngoc Q K Duong
    Perez, Patrick
    Richard, Gael
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (28) : 416 - 428
  • [30] Semantic and Relation Modulation for Audio-Visual Event Localization
    Wang, Hao
    Zha, Zheng-Jun
    Li, Liang
    Chen, Xuejin
    Luo, Jiebo
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725