Audio-Visual Event Localization based on Cross-Modal Interacting Guidance

被引：1

作者：

Yue, Qiurui ^{[1
]}

Wu, Xiaoyu ^{[1
]}

Gao, Jiayi ^{[1
]}

机构：

[1] Commun Univ China, State Key Lab Media Convergence & Commun, Beijing, Peoples R China

来源：

2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021) | 2021年

基金：

中国国家自然科学基金;

关键词：

Audio-Visual Event Location; Multi-Modal Interactions; Attention; Deep Learning;

D O I：

10.1109/AIKE52691.2021.00022

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper studies the audio-visual event localization task, which requires the machine to locate the start and end time of the visual and audio events in the unconstrained video at the same time and identify the event category. To address this task, we propose a cross-modal interacting guidance network. Unlike previous works, it can model the complex relationship within the modality through the audio and video interacting guidance mechanism. Specifically, our cross-modal interacting guidance network is mainly composed of the cross-modal relation-aware network used as the baseline and the audio-visual interacting guidance module we joined. The cross-modal interacting guidance module (CMIG) can dynamically adjust the intra-modal attention of the target modality based on the attention flow of another modality, which is very important for modeling the complex relationships within the modality. Experiments show that our framework achieves the state-of-the-art performance in both full supervised and weakly supervised settings on the Audio-Visual Event Location (AVE) dataset.

引用

页码：104 / 107

页数：4

共 50 条

[1] Temporal Cross-Modal Attention for Audio-Visual Event Localization
Nagasaki, Yoshiki
Hayashi, Masaki
Kaneko, Naoshi
Aoki, Yoshimitsu
[J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
[2] Cross-modal Background Suppression for Audio-Visual Event Localization
Xia, Yan
Zhao, Zhou
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19957 - 19966
[3] Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
Xu, Haoming
Zeng, Runhao
Wu, Qingyao
Tan, Mingkui
Gan, Chuang
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3893 - 3901
[4] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
Bao, Peijun
Yang, Wenhan
Boon Poh Ng
Er, Meng Hwa
Kot, Alex C.
[J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
[5] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Xuan, Hanyu
Zhang, Zhenyu
Chen, Shuo
Yang, Jian
Yan, Yan
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 279 - 286
[6] Deep Cross-Modal Audio-Visual Generation
Chen, Lele
Srivastava, Sudhanshu
Duan, Zhiyao
Xu, Chenliang
[J]. PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 349 - 357
[7] Cross-modal prediction in audio-visual communication
Rao, RR
Chen, TH
[J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2056 - 2059
[8] Audio-Visual Instance Discrimination with Cross-Modal Agreement
Morgado, Pedro
Vasconcelos, Nuno
Misra, Ishan
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12470 - 12481
[9] Cross-Modal Analysis of Audio-Visual Film Montage
Zeppelzauer, Matthias
Mitrovic, Dalibor
Breiteneder, Christian
[J]. 2011 20TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS (ICCCN), 2011,
[10] Cross-Modal learning for Audio-Visual Video Parsing
Lamba, Jatin
Abhishek
Akula, Jayaprakash
Dabral, Rishabh
Jyothi, Preethi
Ramakrishnan, Ganesh
[J]. INTERSPEECH 2021, 2021, : 1937 - 1941

← 1 2 3 4 5 →