Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

被引:0
|
作者
Xuan, Hanyu [1 ]
Zhang, Zhenyu [1 ]
Chen, Shuo [1 ]
Yang, Jian [1 ,2 ]
Yan, Yan [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Minist Educ, PCA Lab,Key Lab Intelligent Percept & Syst & Iigh, Nanjing, Peoples R China
[2] Jiangsu Key Lab Image & Video Understanding Socia, Nanjing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In human multi-modality perception systems, the benefits of integrating auditory and visual information are extensive as they provide plenty supplementary cues for understanding the events. Despite some recent methods proposed for such application, they cannot deal with practical conditions with temporal inconsistency. Inspired by human system which puts different focuses at specific locations, time segments and media while performing multi-modality perception, we provide an attention-based method to simulate such process. Similar to human mechanism, our network can adaptively select "where" to attend, "when" to attend and "which" to attend for audio-visual event localization. In this way, even with large temporal inconsistent between vision and audio, our network is able to adaptively trade information between different modalities and successfully achieve event localization. Our method achieves state-of-the-art performance on AVE (Audio-Visual Event) dataset collected in the real life. In addition, we also systemically investigate audio-visual event localization tasks. The visualization results also help us better understand how our model works.
引用
收藏
页码:279 / 286
页数:8
相关论文
共 50 条
  • [41] Audio-visual cross-modal concept of familiar persons in dogs (Canis familiaris)
    Ogura, Tadatoshi
    Izumi, Shoko
    Imai, Miku
    Nagano, Sakurako
    Matsuura, Akihiro
    [J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2016, 51 : 261 - 261
  • [42] Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
    Hu, Yuchen
    Li, Ruizhe
    Chen, Chen
    Zou, Heqing
    Zhu, Qiushi
    Chng, Eng Siong
    [J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5076 - 5084
  • [43] Attribute-Guided Cross-Modal Interaction and Enhancement for Audio-Visual Matching
    Wang, Jiaxiang
    Zheng, Aihua
    Yan, Yan
    He, Ran
    Tang, Jin
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 4986 - 4998
  • [44] Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
    Gao, Junyu
    Chen, Mengyuan
    Xu, Changsheng
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18827 - 18836
  • [45] Online Cross-Modal Adaptation for Audio-Visual Person Identification With Wearable Cameras
    Brutti, Alessio
    Cavallaro, Andrea
    [J]. IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2017, 47 (01) : 40 - 51
  • [46] Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization
    Dou, Jinqiao
    Chen, Xi
    Wang, Yuehai
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [47] Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval
    Zeng, Donghuo
    Wang, Yanan
    Wu, Jianming
    Ikeda, Kazushi
    [J]. 2022 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2022, : 1 - 9
  • [48] Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
    Xue, Cheng
    Zhong, Xionghu
    Cai, Minjie
    Chen, Hao
    Wang, Wenwu
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 418 - 429
  • [49] Hierarchical cross-modal contextual attention network for visual grounding
    Xu, Xin
    Lv, Gang
    Sun, Yining
    Hu, Yuxia
    Nian, Fudong
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
  • [50] Hierarchical cross-modal contextual attention network for visual grounding
    Xin Xu
    Gang Lv
    Yining Sun
    Yuxia Hu
    Fudong Nian
    [J]. Multimedia Systems, 2023, 29 : 2073 - 2083