Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

被引：0

作者：

Xuan, Hanyu ^{[1
]}

Zhang, Zhenyu ^{[1
]}

Chen, Shuo ^{[1
]}

Yang, Jian ^{[1
,2
]}

Yan, Yan ^{[1
]}

机构：

[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Minist Educ, PCA Lab,Key Lab Intelligent Percept & Syst & Iigh, Nanjing, Peoples R China

[2] Jiangsu Key Lab Image & Video Understanding Socia, Nanjing, Peoples R China

来源：

THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2020年 / 34卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In human multi-modality perception systems, the benefits of integrating auditory and visual information are extensive as they provide plenty supplementary cues for understanding the events. Despite some recent methods proposed for such application, they cannot deal with practical conditions with temporal inconsistency. Inspired by human system which puts different focuses at specific locations, time segments and media while performing multi-modality perception, we provide an attention-based method to simulate such process. Similar to human mechanism, our network can adaptively select "where" to attend, "when" to attend and "which" to attend for audio-visual event localization. In this way, even with large temporal inconsistent between vision and audio, our network is able to adaptively trade information between different modalities and successfully achieve event localization. Our method achieves state-of-the-art performance on AVE (Audio-Visual Event) dataset collected in the real life. In addition, we also systemically investigate audio-visual event localization tasks. The visualization results also help us better understand how our model works.

引用

页码：279 / 286

页数：8

共 50 条

[1] Temporal Cross-Modal Attention for Audio-Visual Event Localization
Nagasaki, Yoshiki
Hayashi, Masaki
Kaneko, Naoshi
Aoki, Yoshimitsu
[J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
[2] Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Xuan, Hanyu
Luo, Lei
Zhang, Zhenyu
Yang, Jian
Yan, Yan
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7878 - 7888
[3] Cross-modal Background Suppression for Audio-Visual Event Localization
Xia, Yan
Zhao, Zhou
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19957 - 19966
[4] Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
Yue, Qiurui
Wu, Xiaoyu
Gao, Jiayi
[J]. 2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 104 - 107
[5] Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
Xu, Haoming
Zeng, Runhao
Wu, Qingyao
Tan, Mingkui
Gan, Chuang
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3893 - 3901
[6] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
Bao, Peijun
Yang, Wenhan
Boon Poh Ng
Er, Meng Hwa
Kot, Alex C.
[J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
[7] Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning
Mercea, Otniel-Bogdan
Hummel, Thomas
Koepke, A. Sophia
Akata, Zeynep
[J]. COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 488 - 505
[8] Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
Tao, Ruijie
Das, Rohan Kumar
Li, Haizhou
[J]. INTERSPEECH 2020, 2020, : 2242 - 2246
[9] Deep Cross-Modal Audio-Visual Generation
Chen, Lele
Srivastava, Sudhanshu
Duan, Zhiyao
Xu, Chenliang
[J]. PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 349 - 357
[10] Cross-modal prediction in audio-visual communication
Rao, RR
Chen, TH
[J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2056 - 2059

← 1 2 3 4 5 →