Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization

被引：0

作者：

Um, Sung Jin ^{[1
]}

Kim, Dongjin ^{[1
]}

Kim, Jung Uk ^{[1
]}

机构：

[1] Kyung Hee Univ, Yongin, South Korea

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

新加坡国家研究基金会;

关键词：

Sound source localization; audio-visual spatial integration; recursive attention; multimodal learning; AIDED VISUAL-SEARCH; NETWORK;

D O I：

10.1145/3581783.3611722

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spatial cues to locate sound sources. In this paper, we propose an audio-visual spatial integration network that integrates spatial cues from both modalities to mimic human behavior when detecting sound-making objects. Additionally, we introduce a recursive attention network to mimic human behavior of iterative focusing on objects, resulting in more accurate attention regions. To effectively encode spatial information from both modalities, we propose audio-visual pair matching loss and spatial region alignment loss. By utilizing the spatial cues of audio-visual modalities and recursively focusing objects, our method can perform more robust sound source localization. Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source datasets demonstrate the superiority of our proposed method over existing approaches. Our code is available at: https://github.com/VisualAIKHU/SIRA-SSL.

引用

页码：3507 / 3516

页数：10

共 50 条

[1] Audio-Visual Fusion for Sound Source Localization and Improved Attention
Lee, Byoung-gi
Choi, JongSuk
Yoon, SangSuk
Choi, Mun-Taek
Kim, Munsang
Kim, Daijin
[J]. TRANSACTIONS OF THE KOREAN SOCIETY OF MECHANICAL ENGINEERS A, 2011, 35 (07) : 737 - 743
[2] AUDIO-VISUAL DISCREPANCY AND THE INFLUENCE ON VERTICAL SOUND SOURCE LOCALIZATION
Werner, Stephan
Liebetrau, Judith
Sporer, Thomas
[J]. 2012 Fourth International Workshop on Quality of Multimedia Experience (QoMEX), 2012, : 133 - 139
[3] Real-time sound source localization and separation based on active audio-visual integration
Okuno, HG
Nakadai, K
[J]. COMPUTATIONAL METHODS IN NEURAL MODELING, PT 1, 2003, 2686 : 118 - 125
[4] Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
Masuyama, Yoshiki
Bando, Yoshiaki
Yatabe, Kohei
Sasaki, Yoko
Onishi, Masaki
Oikawa, Yasuhiro
[J]. 2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 4848 - 4854
[5] Audio-visual integration during overt visual attention
Quigley, Cliodhna
Onat, Selim
Harding, Sue
Cooke, Martin
Koenig, Peter
[J]. JOURNAL OF EYE MOVEMENT RESEARCH, 2007, 1 (02):
[6] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
Duan, Bin
Tang, Hao
Wang, Wei
Zong, Ziliang
Yang, Guowei
Yan, Yan
[J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4012 - 4021
[7] Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
Xue, Cheng
Zhong, Xionghu
Cai, Minjie
Chen, Hao
Wang, Wenwu
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 418 - 429
[8] Audio-visual spatial alignment improves integration in the presence of a competing audio-visual stimulus
Fleming, Justin T.
Noyce, Abigail L.
Shinn-Cunningham, Barbara G.
[J]. NEUROPSYCHOLOGIA, 2020, 146
[9] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
Xuan, Hanyu
Wu, Zhiliang
Yang, Jian
Jiang, Bo
Luo, Lei
Alameda-Pineda, Xavier
Yan, Yan
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
[10] Dual Attention Matching for Audio-Visual Event Localization
Wu, Yu
Zhu, Linchao
Yan, Yan
Yang, Yi
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309

← 1 2 3 4 5 →