Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

被引:0
|
作者
Xuan, Hanyu [1 ]
Zhang, Zhenyu [1 ]
Chen, Shuo [1 ]
Yang, Jian [1 ,2 ]
Yan, Yan [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Minist Educ, PCA Lab,Key Lab Intelligent Percept & Syst & Iigh, Nanjing, Peoples R China
[2] Jiangsu Key Lab Image & Video Understanding Socia, Nanjing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In human multi-modality perception systems, the benefits of integrating auditory and visual information are extensive as they provide plenty supplementary cues for understanding the events. Despite some recent methods proposed for such application, they cannot deal with practical conditions with temporal inconsistency. Inspired by human system which puts different focuses at specific locations, time segments and media while performing multi-modality perception, we provide an attention-based method to simulate such process. Similar to human mechanism, our network can adaptively select "where" to attend, "when" to attend and "which" to attend for audio-visual event localization. In this way, even with large temporal inconsistent between vision and audio, our network is able to adaptively trade information between different modalities and successfully achieve event localization. Our method achieves state-of-the-art performance on AVE (Audio-Visual Event) dataset collected in the real life. In addition, we also systemically investigate audio-visual event localization tasks. The visualization results also help us better understand how our model works.
引用
收藏
页码:279 / 286
页数:8
相关论文
共 50 条
  • [1] Temporal Cross-Modal Attention for Audio-Visual Event Localization
    Nagasaki, Yoshiki
    Hayashi, Masaki
    Kaneko, Naoshi
    Aoki, Yoshimitsu
    [J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
  • [2] Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization
    Xuan, Hanyu
    Luo, Lei
    Zhang, Zhenyu
    Yang, Jian
    Yan, Yan
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7878 - 7888
  • [3] Cross-modal Background Suppression for Audio-Visual Event Localization
    Xia, Yan
    Zhao, Zhou
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19957 - 19966
  • [4] Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
    Yue, Qiurui
    Wu, Xiaoyu
    Gao, Jiayi
    [J]. 2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 104 - 107
  • [5] Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
    Xu, Haoming
    Zeng, Runhao
    Wu, Qingyao
    Tan, Mingkui
    Gan, Chuang
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3893 - 3901
  • [6] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
    Bao, Peijun
    Yang, Wenhan
    Boon Poh Ng
    Er, Meng Hwa
    Kot, Alex C.
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
  • [7] Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning
    Mercea, Otniel-Bogdan
    Hummel, Thomas
    Koepke, A. Sophia
    Akata, Zeynep
    [J]. COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 488 - 505
  • [8] Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
    Tao, Ruijie
    Das, Rohan Kumar
    Li, Haizhou
    [J]. INTERSPEECH 2020, 2020, : 2242 - 2246
  • [9] Deep Cross-Modal Audio-Visual Generation
    Chen, Lele
    Srivastava, Sudhanshu
    Duan, Zhiyao
    Xu, Chenliang
    [J]. PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 349 - 357
  • [10] Cross-modal prediction in audio-visual communication
    Rao, RR
    Chen, TH
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2056 - 2059