Cross-modal Background Suppression for Audio-Visual Event Localization

被引:15
|
作者
Xia, Yan [1 ]
Zhao, Zhou [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 浙江省自然科学基金;
关键词
NETWORK;
D O I
10.1109/CVPR52688.2022.01936
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audiovisual Event (AVE) localization requires the model to jointly localize an event by observing audio and visual information. However, in unconstrained videos, both information types may be inconsistent or suffer from severe background noise. Hence this paper proposes a novel cross-modal background suppression network for AVE task, operating at the time- and event-level, aiming to improve localization performance through suppressing asynchronous audiovisual background frames from the examined events and reducing redundant noise. Specifically, the time-level background suppression scheme forces the audio and visual modality to focus on the related information in the temporal dimension that the opposite modality considers essential, and reduces attention to the segments that the other modal considers as background. The event-level background suppression scheme uses the class activation sequences predicted by audio and visual modalities to control the final event category prediction, which can effectively suppress noise events occurring accidentally in a single modality. Furthermore, we introduce a cross-modal gated attention scheme to extract relevant visual regions from complex scenes exploiting both global visual and audio signals. Extensive experiments show our method outperforms the state-of-the-art methods by a large margin in both supervised and weakly supervised AVE settings.(1)
引用
收藏
页码:19957 / 19966
页数:10
相关论文
共 50 条
  • [1] Temporal Cross-Modal Attention for Audio-Visual Event Localization
    Nagasaki, Yoshiki
    Hayashi, Masaki
    Kaneko, Naoshi
    Aoki, Yoshimitsu
    [J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
  • [2] Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
    Yue, Qiurui
    Wu, Xiaoyu
    Gao, Jiayi
    [J]. 2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 104 - 107
  • [3] Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
    Xu, Haoming
    Zeng, Runhao
    Wu, Qingyao
    Tan, Mingkui
    Gan, Chuang
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3893 - 3901
  • [4] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
    Bao, Peijun
    Yang, Wenhan
    Boon Poh Ng
    Er, Meng Hwa
    Kot, Alex C.
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
  • [5] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
    Xuan, Hanyu
    Zhang, Zhenyu
    Chen, Shuo
    Yang, Jian
    Yan, Yan
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 279 - 286
  • [6] Deep Cross-Modal Audio-Visual Generation
    Chen, Lele
    Srivastava, Sudhanshu
    Duan, Zhiyao
    Xu, Chenliang
    [J]. PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 349 - 357
  • [7] Cross-modal prediction in audio-visual communication
    Rao, RR
    Chen, TH
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2056 - 2059
  • [8] Audio-Visual Instance Discrimination with Cross-Modal Agreement
    Morgado, Pedro
    Vasconcelos, Nuno
    Misra, Ishan
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12470 - 12481
  • [9] Cross-Modal Analysis of Audio-Visual Film Montage
    Zeppelzauer, Matthias
    Mitrovic, Dalibor
    Breiteneder, Christian
    [J]. 2011 20TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS (ICCCN), 2011,
  • [10] Cross-Modal learning for Audio-Visual Video Parsing
    Lamba, Jatin
    Abhishek
    Akula, Jayaprakash
    Dabral, Rishabh
    Jyothi, Preethi
    Ramakrishnan, Ganesh
    [J]. INTERSPEECH 2021, 2021, : 1937 - 1941