Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization

被引:0
|
作者
Um, Sung Jin [1 ]
Kim, Dongjin [1 ]
Kim, Jung Uk [1 ]
机构
[1] Kyung Hee Univ, Yongin, South Korea
基金
新加坡国家研究基金会;
关键词
Sound source localization; audio-visual spatial integration; recursive attention; multimodal learning; AIDED VISUAL-SEARCH; NETWORK;
D O I
10.1145/3581783.3611722
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spatial cues to locate sound sources. In this paper, we propose an audio-visual spatial integration network that integrates spatial cues from both modalities to mimic human behavior when detecting sound-making objects. Additionally, we introduce a recursive attention network to mimic human behavior of iterative focusing on objects, resulting in more accurate attention regions. To effectively encode spatial information from both modalities, we propose audio-visual pair matching loss and spatial region alignment loss. By utilizing the spatial cues of audio-visual modalities and recursively focusing objects, our method can perform more robust sound source localization. Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source datasets demonstrate the superiority of our proposed method over existing approaches. Our code is available at: https://github.com/VisualAIKHU/SIRA-SSL.
引用
收藏
页码:3507 / 3516
页数:10
相关论文
共 50 条
  • [1] Audio-Visual Fusion for Sound Source Localization and Improved Attention
    Lee, Byoung-gi
    Choi, JongSuk
    Yoon, SangSuk
    Choi, Mun-Taek
    Kim, Munsang
    Kim, Daijin
    [J]. TRANSACTIONS OF THE KOREAN SOCIETY OF MECHANICAL ENGINEERS A, 2011, 35 (07) : 737 - 743
  • [2] AUDIO-VISUAL DISCREPANCY AND THE INFLUENCE ON VERTICAL SOUND SOURCE LOCALIZATION
    Werner, Stephan
    Liebetrau, Judith
    Sporer, Thomas
    [J]. 2012 Fourth International Workshop on Quality of Multimedia Experience (QoMEX), 2012, : 133 - 139
  • [3] Real-time sound source localization and separation based on active audio-visual integration
    Okuno, HG
    Nakadai, K
    [J]. COMPUTATIONAL METHODS IN NEURAL MODELING, PT 1, 2003, 2686 : 118 - 125
  • [4] Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
    Masuyama, Yoshiki
    Bando, Yoshiaki
    Yatabe, Kohei
    Sasaki, Yoko
    Onishi, Masaki
    Oikawa, Yasuhiro
    [J]. 2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 4848 - 4854
  • [5] Audio-visual integration during overt visual attention
    Quigley, Cliodhna
    Onat, Selim
    Harding, Sue
    Cooke, Martin
    Koenig, Peter
    [J]. JOURNAL OF EYE MOVEMENT RESEARCH, 2007, 1 (02):
  • [6] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
    Duan, Bin
    Tang, Hao
    Wang, Wei
    Zong, Ziliang
    Yang, Guowei
    Yan, Yan
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4012 - 4021
  • [7] Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
    Xue, Cheng
    Zhong, Xionghu
    Cai, Minjie
    Chen, Hao
    Wang, Wenwu
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 418 - 429
  • [8] Audio-visual spatial alignment improves integration in the presence of a competing audio-visual stimulus
    Fleming, Justin T.
    Noyce, Abigail L.
    Shinn-Cunningham, Barbara G.
    [J]. NEUROPSYCHOLOGIA, 2020, 146
  • [9] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
  • [10] Dual Attention Matching for Audio-Visual Event Localization
    Wu, Yu
    Zhu, Linchao
    Yan, Yan
    Yang, Yi
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309