Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

被引:0
|
作者
Liu, Tianyu [1 ,2 ]
Zhang, Peng [1 ,2 ]
Huang, Wei [3 ]
Zha, Yufei [1 ,2 ]
You, Tao [1 ]
Zhang, Yanning [1 ]
机构
[1] Northwestern Polytech Univ, Xian, Shanxi, Peoples R China
[2] Northwestern Polytech Univ, Ningbo Inst, Ningbo, Zhejiang, Peoples R China
[3] Nanchang Univ, Nanchang, Jiangxi, Peoples R China
基金
中国国家自然科学基金;
关键词
audio-visual; sound source localization; contrastive learning; modality gap; EYE;
D O I
10.1145/3581783.3612502
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.
引用
收藏
页码:4042 / 4052
页数:11
相关论文
共 50 条
  • [41] AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS
    Chern, I-Chun
    Hung, Kuo-Hsuan
    Chen, Yi-Ting
    Hussain, Tassadaq
    Gogate, Mandar
    Hussain, Amir
    Tsao, Yu
    Hou, Jen-Cheng
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [42] A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Yan, Yan
    Alameda-Pineda, Xavier
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1019 - 1028
  • [43] Self-Supervised Incremental Learning for Sound Source Localization in Complex Indoor Environment
    Liu, Hangxin
    Zhang, Zeyu
    Zhu, Yixin
    Zhu, Song-Chun
    2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 2599 - 2605
  • [44] Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization
    Xuan, Hanyu
    Luo, Lei
    Zhang, Zhenyu
    Yang, Jian
    Yan, Yan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7878 - 7888
  • [45] DUAL-MODALITY SEQ2SEQ NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
    Lin, Yan-Bo
    Li, Yu-Jhe
    Wang, Yu-Chiang Frank
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2002 - 2006
  • [46] Audio-visual based non-line-of-sight sound source localization: A feasibility study
    King, E. A.
    Tatoglu, A.
    Iglesias, D.
    Matriss, A.
    APPLIED ACOUSTICS, 2021, 171
  • [47] Real-time sound source localization and separation based on active audio-visual integration
    Okuno, HG
    Nakadai, K
    COMPUTATIONAL METHODS IN NEURAL MODELING, PT 1, 2003, 2686 : 118 - 125
  • [48] Audio-Visual Sound Source Localization and Tracking Based on Mobile Robot for The Cocktail Party Problem
    Shi, Zhanbo
    Zhang, Lin
    Wang, Dongqing
    APPLIED SCIENCES-BASEL, 2023, 13 (10):
  • [49] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
    Ishikawa, Reina
    Hachiuma, Ryo
    Kurobe, Akiyoshi
    Saito, Hideo
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
  • [50] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
    Ishikawa, Reina
    Hachiuma, Ryo
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 64346 - 64357