Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

被引:0
|
作者
Liu, Tianyu [1 ,2 ]
Zhang, Peng [1 ,2 ]
Huang, Wei [3 ]
Zha, Yufei [1 ,2 ]
You, Tao [1 ]
Zhang, Yanning [1 ]
机构
[1] Northwestern Polytech Univ, Xian, Shanxi, Peoples R China
[2] Northwestern Polytech Univ, Ningbo Inst, Ningbo, Zhejiang, Peoples R China
[3] Nanchang Univ, Nanchang, Jiangxi, Peoples R China
基金
中国国家自然科学基金;
关键词
audio-visual; sound source localization; contrastive learning; modality gap; EYE;
D O I
10.1145/3581783.3612502
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.
引用
收藏
页码:4042 / 4052
页数:11
相关论文
共 50 条
  • [1] Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
    Masuyama, Yoshiki
    Bando, Yoshiaki
    Yatabe, Kohei
    Sasaki, Yoko
    Onishi, Masaki
    Oikawa, Yasuhiro
    2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 4848 - 4854
  • [2] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
    Krishnamurthy, Sudha
    ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138
  • [3] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
  • [4] DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
    Fujita, Yoto
    Bando, Yoshiaki
    Imoto, Keisuke
    Onishi, Masaki
    Yoshii, Kazuyoshi
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2061 - 2067
  • [5] Self-Supervised Audio-Visual Soundscape Stylization
    Li, Tingle
    Wang, Renhao
    Huang, Po-Yao
    Owens, Andrew
    Anumanchipalli, Gopala
    COMPUTER VISION - ECCV 2024, PT LXXX, 2025, 15138 : 20 - 40
  • [6] Robust Self-Supervised Audio-Visual Speech Recognition
    Shi, Bowen
    Hsu, Wei-Ning
    Mohamed, Abdelrahman
    INTERSPEECH 2022, 2022, : 2118 - 2122
  • [7] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
    Rouditchenko, Andrew
    Zhao, Hang
    Gan, Chuang
    McDermott, Josh
    Torralba, Antonio
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361
  • [8] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
    Ding, Yifan
    Xu, Yong
    Zhang, Shi-Xiong
    Cong, Yahuan
    Wang, Liqiang
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
  • [9] Audio-visual self-supervised representation learning: A survey
    Alsuwat, Manal
    Al-Shareef, Sarah
    Alghamdi, Manal
    NEUROCOMPUTING, 2025, 634
  • [10] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
    Ran, Yue
    Tang, Hongying
    Li, Baoqing
    Wang, Guohui
    APPLIED SCIENCES-BASEL, 2022, 12 (24):