Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

被引：0

作者：

Xuan, Hanyu ^{[1
]}

Zhang, Zhenyu ^{[1
]}

Chen, Shuo ^{[1
]}

Yang, Jian ^{[1
,2
]}

Yan, Yan ^{[1
]}

机构：

[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Minist Educ, PCA Lab,Key Lab Intelligent Percept & Syst & Iigh, Nanjing, Peoples R China

[2] Jiangsu Key Lab Image & Video Understanding Socia, Nanjing, Peoples R China

来源：

THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2020年 / 34卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In human multi-modality perception systems, the benefits of integrating auditory and visual information are extensive as they provide plenty supplementary cues for understanding the events. Despite some recent methods proposed for such application, they cannot deal with practical conditions with temporal inconsistency. Inspired by human system which puts different focuses at specific locations, time segments and media while performing multi-modality perception, we provide an attention-based method to simulate such process. Similar to human mechanism, our network can adaptively select "where" to attend, "when" to attend and "which" to attend for audio-visual event localization. In this way, even with large temporal inconsistent between vision and audio, our network is able to adaptively trade information between different modalities and successfully achieve event localization. Our method achieves state-of-the-art performance on AVE (Audio-Visual Event) dataset collected in the real life. In addition, we also systemically investigate audio-visual event localization tasks. The visualization results also help us better understand how our model works.

引用

页码：279 / 286

页数：8

共 50 条

[41] Audio-visual cross-modal concept of familiar persons in dogs (Canis familiaris)
Ogura, Tadatoshi
Izumi, Shoko
Imai, Miku
Nagano, Sakurako
Matsuura, Akihiro
[J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2016, 51 : 261 - 261
[42] Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
Hu, Yuchen
Li, Ruizhe
Chen, Chen
Zou, Heqing
Zhu, Qiushi
Chng, Eng Siong
[J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5076 - 5084
[43] Attribute-Guided Cross-Modal Interaction and Enhancement for Audio-Visual Matching
Wang, Jiaxiang
Zheng, Aihua
Yan, Yan
He, Ran
Tang, Jin
[J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 4986 - 4998
[44] Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
Gao, Junyu
Chen, Mengyuan
Xu, Changsheng
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18827 - 18836
[45] Online Cross-Modal Adaptation for Audio-Visual Person Identification With Wearable Cameras
Brutti, Alessio
Cavallaro, Andrea
[J]. IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2017, 47 (01) : 40 - 51
[46] Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization
Dou, Jinqiao
Chen, Xi
Wang, Yuehai
[J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[47] Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval
Zeng, Donghuo
Wang, Yanan
Wu, Jianming
Ikeda, Kazushi
[J]. 2022 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2022, : 1 - 9
[48] Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
Xue, Cheng
Zhong, Xionghu
Cai, Minjie
Chen, Hao
Wang, Wenwu
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 418 - 429
[49] Hierarchical cross-modal contextual attention network for visual grounding
Xu, Xin
Lv, Gang
Sun, Yining
Hu, Yuxia
Nian, Fudong
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
[50] Hierarchical cross-modal contextual attention network for visual grounding
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
[J]. Multimedia Systems, 2023, 29 : 2073 - 2083

← 1 2 3 4 5 →