Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

被引：0

作者：

Liu, Tianyu ^{[1
,2
]}

Zhang, Peng ^{[1
,2
]}

Huang, Wei ^{[3
]}

Zha, Yufei ^{[1
,2
]}

You, Tao ^{[1
]}

Zhang, Yanning ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Xian, Shanxi, Peoples R China

[2] Northwestern Polytech Univ, Ningbo Inst, Ningbo, Zhejiang, Peoples R China

[3] Nanchang Univ, Nanchang, Jiangxi, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

audio-visual; sound source localization; contrastive learning; modality gap; EYE;

D O I：

10.1145/3581783.3612502

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.

引用

页码：4042 / 4052

页数：11

共 50 条

[41] AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS
Chern, I-Chun
Hung, Kuo-Hsuan
Chen, Yi-Ting
Hussain, Tassadaq
Gogate, Mandar
Hussain, Amir
Tsao, Yu
Hou, Jen-Cheng
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
[42] A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos
Xuan, Hanyu
Wu, Zhiliang
Yang, Jian
Yan, Yan
Alameda-Pineda, Xavier
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1019 - 1028
[43] Self-Supervised Incremental Learning for Sound Source Localization in Complex Indoor Environment
Liu, Hangxin
Zhang, Zeyu
Zhu, Yixin
Zhu, Song-Chun
2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 2599 - 2605
[44] Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Xuan, Hanyu
Luo, Lei
Zhang, Zhenyu
Yang, Jian
Yan, Yan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7878 - 7888
[45] DUAL-MODALITY SEQ2SEQ NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
Lin, Yan-Bo
Li, Yu-Jhe
Wang, Yu-Chiang Frank
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2002 - 2006
[46] Audio-visual based non-line-of-sight sound source localization: A feasibility study
King, E. A.
Tatoglu, A.
Iglesias, D.
Matriss, A.
APPLIED ACOUSTICS, 2021, 171
[47] Real-time sound source localization and separation based on active audio-visual integration
Okuno, HG
Nakadai, K
COMPUTATIONAL METHODS IN NEURAL MODELING, PT 1, 2003, 2686 : 118 - 125
[48] Audio-Visual Sound Source Localization and Tracking Based on Mobile Robot for The Cocktail Party Problem
Shi, Zhanbo
Zhang, Lin
Wang, Dongqing
APPLIED SCIENCES-BASEL, 2023, 13 (10):
[49] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
Ishikawa, Reina
Hachiuma, Ryo
Kurobe, Akiyoshi
Saito, Hideo
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
[50] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
Ishikawa, Reina
Hachiuma, Ryo
Saito, Hideo
IEEE ACCESS, 2021, 9 : 64346 - 64357

← 1 2 3 4 5 →