Cross-modal Background Suppression for Audio-Visual Event Localization

被引:15
|
作者
Xia, Yan [1 ]
Zhao, Zhou [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
基金
浙江省自然科学基金; 中国国家自然科学基金;
关键词
NETWORK;
D O I
10.1109/CVPR52688.2022.01936
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audiovisual Event (AVE) localization requires the model to jointly localize an event by observing audio and visual information. However, in unconstrained videos, both information types may be inconsistent or suffer from severe background noise. Hence this paper proposes a novel cross-modal background suppression network for AVE task, operating at the time- and event-level, aiming to improve localization performance through suppressing asynchronous audiovisual background frames from the examined events and reducing redundant noise. Specifically, the time-level background suppression scheme forces the audio and visual modality to focus on the related information in the temporal dimension that the opposite modality considers essential, and reduces attention to the segments that the other modal considers as background. The event-level background suppression scheme uses the class activation sequences predicted by audio and visual modalities to control the final event category prediction, which can effectively suppress noise events occurring accidentally in a single modality. Furthermore, we introduce a cross-modal gated attention scheme to extract relevant visual regions from complex scenes exploiting both global visual and audio signals. Extensive experiments show our method outperforms the state-of-the-art methods by a large margin in both supervised and weakly supervised AVE settings.(1)
引用
收藏
页码:19957 / 19966
页数:10
相关论文
共 50 条
  • [41] Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition
    Takashima, Akihiko
    Masumura, Ryo
    Ando, Atsushi
    Yamazaki, Yoshihiro
    Uchida, Mihiro
    Orihashi, Shota
    [J]. INTERSPEECH 2022, 2022, : 4740 - 4744
  • [42] Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation
    Lee, Jiyoung
    Chung, Soo-Whan
    Kim, Sunok
    Kang, Hong-Goo
    Sohn, Kwanghoon
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1336 - 1345
  • [43] Modeling implicit learning in a cross-modal audio-visual serial reaction time task
    Taesler, Philipp
    Jablonowski, Julia
    Fu, Qiufang
    Rose, Michael
    [J]. COGNITIVE SYSTEMS RESEARCH, 2019, 54 : 154 - 164
  • [44] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [45] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
    Ge, Shiping
    Jiang, Zhiwei
    Yin, Yafeng
    Wang, Cong
    Cheng, Zifeng
    Gu, Qing
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
  • [46] Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA
    Zeng, Donghuo
    Yu, Yi
    Oyama, Keizo
    [J]. 2018 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2018), 2018, : 143 - 150
  • [47] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval
    Zeng, Donghuo
    Yu, Yi
    Oyama, Keizo
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (03)
  • [48] IMPROVING AUDIO-VISUAL SPEECH RECOGNITION PERFORMANCE WITH CROSS-MODAL STUDENT-TEACHER TRAINING
    Li, Wei
    Wang, Sicheng
    Lei, Ming
    Siniscalchi, Sabato Marco
    Lee, Chin-Hui
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6560 - 6564
  • [49] Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion Recognition
    Praveen, R. Gnana
    Alam, Jahangir
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2024, 18 (03) : 444 - 458
  • [50] Audio-Visual Cross-Modal Correspondences of Perceived Urgency: Examination through a Speeded Discrimination Task
    Naka, Kiichi
    Yamauchi, Katsuya
    [J]. MULTISENSORY RESEARCH, 2023, 36 (05) : 413 - 428