Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

被引：0

作者：

Liu, Tianyu ^{[1
,2
]}

Zhang, Peng ^{[1
,2
]}

Huang, Wei ^{[3
]}

Zha, Yufei ^{[1
,2
]}

You, Tao ^{[1
]}

Zhang, Yanning ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Xian, Shanxi, Peoples R China

[2] Northwestern Polytech Univ, Ningbo Inst, Ningbo, Zhejiang, Peoples R China

[3] Nanchang Univ, Nanchang, Jiangxi, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

audio-visual; sound source localization; contrastive learning; modality gap; EYE;

D O I：

10.1145/3581783.3612502

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.

引用

页码：4042 / 4052

页数：11

共 50 条

[1] Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
Masuyama, Yoshiki
Bando, Yoshiaki
Yatabe, Kohei
Sasaki, Yoko
Onishi, Masaki
Oikawa, Yasuhiro
2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 4848 - 4854
[2] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
Krishnamurthy, Sudha
ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138
[3] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
Xuan, Hanyu
Wu, Zhiliang
Yang, Jian
Jiang, Bo
Luo, Lei
Alameda-Pineda, Xavier
Yan, Yan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
[4] DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
Fujita, Yoto
Bando, Yoshiaki
Imoto, Keisuke
Onishi, Masaki
Yoshii, Kazuyoshi
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2061 - 2067
[5] Self-Supervised Audio-Visual Soundscape Stylization
Li, Tingle
Wang, Renhao
Huang, Po-Yao
Owens, Andrew
Anumanchipalli, Gopala
COMPUTER VISION - ECCV 2024, PT LXXX, 2025, 15138 : 20 - 40
[6] Robust Self-Supervised Audio-Visual Speech Recognition
Shi, Bowen
Hsu, Wei-Ning
Mohamed, Abdelrahman
INTERSPEECH 2022, 2022, : 2118 - 2122
[7] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
Rouditchenko, Andrew
Zhao, Hang
Gan, Chuang
McDermott, Josh
Torralba, Antonio
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361
[8] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
Ding, Yifan
Xu, Yong
Zhang, Shi-Xiong
Cong, Yahuan
Wang, Liqiang
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
[9] Audio-visual self-supervised representation learning: A survey
Alsuwat, Manal
Al-Shareef, Sarah
Alghamdi, Manal
NEUROCOMPUTING, 2025, 634
[10] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
Ran, Yue
Tang, Hongying
Li, Baoqing
Wang, Guohui
APPLIED SCIENCES-BASEL, 2022, 12 (24):

← 1 2 3 4 5 →