Dual Attention Matching for Audio-Visual Event Localization

被引：131

作者：

Wu, Yu ^{[1
,2
]}

Zhu, Linchao ^{[2
]}

Yan, Yan ^{[3
]}

Yang, Yi ^{[2
]}

机构：

[1] Baidu Res, Beijing, Peoples R China

[2] Univ Technol Sydney, ReLER, Sydney, NSW, Australia

[3] Texas State Univ, San Marcos, TX USA

来源：

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年

关键词：

D O I：

10.1109/ICCV.2019.00639

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we investigate the audio-visual event localization problem. This task is to localize a visible and audible event in a video. Previous methods first divide a video into short segments, and then fuse visual and acoustic features at the segment level. The duration of these segments is usually short, making the visual and acoustic feature of each segment possibly not well aligned. Direct concatenation of the two features at the segment level can be vulnerable to a minor temporal misalignment of the two signals. We propose a Dual Attention Matching (DAM) module to cover a longer video duration for better high-level event information modeling, while the local temporal information is attained by the global cross-check mechanism. Our premise is that one should watch the whole video to understand the high-level event, while shorter segments should be checked in detail for localization. Specifically, the global feature of one modality queries the local feature in the other modality in a bi-directional way. With temporal co-occurrence encoded between auditory and visual signals, DAM can be readily applied in various audio-visual event localization tasks, e.g., cross-modality localization, supervised event localization. Experiments on the AVE dataset show our method outperforms the state-of-the-art by a large margin.

引用

页码：6301 / 6309

页数：9

共 50 条

[41] Acoustic and Visual Knowledge Distillation for Contrastive Audio-Visual Localization
Yaghoubi, Ehsan
Kelm, Andre
Gerkmann, Timo
Frintrop, Simone
[J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 15 - 23
[42] Multimodal pattern matching for audio-visual query and retrieval
Naphade, MR
Wang, R
Huang, TS
[J]. STORAGE AND RETRIEVAL FOR MEDIA DATABASES 2001, 2001, 4315 : 188 - 195
[43] An Audio-Visual Attention System for Online Association Learning
Heckmann, Martin
Brandl, Holger
Domont, Xavier
Bolder, Bram
Joublin, Frank
Goerick, Christian
[J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2127 - 2130
[44] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
Li, Chenda
Qian, Yanmin
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
[45] AUDIO-VISUAL EVENT RECOGNITION THROUGH THE LENS OF ADVERSARY
Li, Juncheng B.
Ma, Kaixin
Qu, Shuhui
Huang, Po-Yao
Metze, Florian
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 616 - 620
[46] Audio-visual event recognition in surveillance video sequences
Cristani, Marco
Bicego, Manuele
Murino, Vittorio
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 257 - 267
[47] DMMAN: A two-stage audio-visual fusion framework for sound separation and event localization
Hu, Ruihan
Zhou, Songbing
Tang, Zhi Ri
Chang, Sheng
Huang, Qijun
Liu, Yisen
Han, Wei
Wu, Edmond Q.
[J]. NEURAL NETWORKS, 2021, 133 : 229 - 239
[48] Looking and Hearing Into Details: Dual-Enhanced Siamese Adversarial Network for Audio-Visual Matching
Wang, Jiaxiang
Li, Chenglong
Zheng, Aihua
Tang, Jin
Luo, Bin
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7505 - 7516
[49] Tracking atoms with particles for audio-visual source localization
Monaci, Gianluca
Vandergheynst, Pierre
Maggio, Emilio
Cavallaro, Andrea
[J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 753 - +
[50] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
Gebru, Israel D.
Alameda-Pineda, Xavier
Horaud, Radu
Forbes, Florence
[J]. 2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,

← 1 2 3 4 5 →