Dual Attention Matching for Audio-Visual Event Localization

被引:131
|
作者
Wu, Yu [1 ,2 ]
Zhu, Linchao [2 ]
Yan, Yan [3 ]
Yang, Yi [2 ]
机构
[1] Baidu Res, Beijing, Peoples R China
[2] Univ Technol Sydney, ReLER, Sydney, NSW, Australia
[3] Texas State Univ, San Marcos, TX USA
关键词
D O I
10.1109/ICCV.2019.00639
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we investigate the audio-visual event localization problem. This task is to localize a visible and audible event in a video. Previous methods first divide a video into short segments, and then fuse visual and acoustic features at the segment level. The duration of these segments is usually short, making the visual and acoustic feature of each segment possibly not well aligned. Direct concatenation of the two features at the segment level can be vulnerable to a minor temporal misalignment of the two signals. We propose a Dual Attention Matching (DAM) module to cover a longer video duration for better high-level event information modeling, while the local temporal information is attained by the global cross-check mechanism. Our premise is that one should watch the whole video to understand the high-level event, while shorter segments should be checked in detail for localization. Specifically, the global feature of one modality queries the local feature in the other modality in a bi-directional way. With temporal co-occurrence encoded between auditory and visual signals, DAM can be readily applied in various audio-visual event localization tasks, e.g., cross-modality localization, supervised event localization. Experiments on the AVE dataset show our method outperforms the state-of-the-art by a large margin.
引用
收藏
页码:6301 / 6309
页数:9
相关论文
共 50 条
  • [41] Acoustic and Visual Knowledge Distillation for Contrastive Audio-Visual Localization
    Yaghoubi, Ehsan
    Kelm, Andre
    Gerkmann, Timo
    Frintrop, Simone
    [J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 15 - 23
  • [42] Multimodal pattern matching for audio-visual query and retrieval
    Naphade, MR
    Wang, R
    Huang, TS
    [J]. STORAGE AND RETRIEVAL FOR MEDIA DATABASES 2001, 2001, 4315 : 188 - 195
  • [43] An Audio-Visual Attention System for Online Association Learning
    Heckmann, Martin
    Brandl, Holger
    Domont, Xavier
    Bolder, Bram
    Joublin, Frank
    Goerick, Christian
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2127 - 2130
  • [44] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
    Li, Chenda
    Qian, Yanmin
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
  • [45] AUDIO-VISUAL EVENT RECOGNITION THROUGH THE LENS OF ADVERSARY
    Li, Juncheng B.
    Ma, Kaixin
    Qu, Shuhui
    Huang, Po-Yao
    Metze, Florian
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 616 - 620
  • [46] Audio-visual event recognition in surveillance video sequences
    Cristani, Marco
    Bicego, Manuele
    Murino, Vittorio
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 257 - 267
  • [47] DMMAN: A two-stage audio-visual fusion framework for sound separation and event localization
    Hu, Ruihan
    Zhou, Songbing
    Tang, Zhi Ri
    Chang, Sheng
    Huang, Qijun
    Liu, Yisen
    Han, Wei
    Wu, Edmond Q.
    [J]. NEURAL NETWORKS, 2021, 133 : 229 - 239
  • [48] Looking and Hearing Into Details: Dual-Enhanced Siamese Adversarial Network for Audio-Visual Matching
    Wang, Jiaxiang
    Li, Chenglong
    Zheng, Aihua
    Tang, Jin
    Luo, Bin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7505 - 7516
  • [49] Tracking atoms with particles for audio-visual source localization
    Monaci, Gianluca
    Vandergheynst, Pierre
    Maggio, Emilio
    Cavallaro, Andrea
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 753 - +
  • [50] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
    Gebru, Israel D.
    Alameda-Pineda, Xavier
    Horaud, Radu
    Forbes, Florence
    [J]. 2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,