Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

被引:0
|
作者
Liu, Tianyu [1 ,2 ]
Zhang, Peng [1 ,2 ]
Huang, Wei [3 ]
Zha, Yufei [1 ,2 ]
You, Tao [1 ]
Zhang, Yanning [1 ]
机构
[1] Northwestern Polytech Univ, Xian, Shanxi, Peoples R China
[2] Northwestern Polytech Univ, Ningbo Inst, Ningbo, Zhejiang, Peoples R China
[3] Nanchang Univ, Nanchang, Jiangxi, Peoples R China
基金
中国国家自然科学基金;
关键词
audio-visual; sound source localization; contrastive learning; modality gap; EYE;
D O I
10.1145/3581783.3612502
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.
引用
收藏
页码:4042 / 4052
页数:11
相关论文
共 50 条
  • [31] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
    Kurobe, Akiyoshi
    Nakajima, Yoshikatsu
    Kitani, Kris
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 29970 - 29979
  • [32] Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization
    Fedorishin, Dennis
    Mohan, Deen Dayal
    Jawade, Bhavin
    Setlur, Srirangaraj
    Govindaraju, Venu
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2277 - 2286
  • [33] BI-DIRECTIONAL MODALITY FUSION NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
    Liu, Shuo
    Quan, Weize
    Liu, Yuan
    Yan, Dong-Ming
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4868 - 4872
  • [34] Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization
    Um, Sung Jin
    Kim, Dongjin
    Kim, Jung Uk
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3507 - 3516
  • [35] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
    Sun, Licai
    Lian, Zheng
    Liu, Bin
    Tao, Jianhua
    Information Fusion, 2024, 108
  • [36] AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness
    Yang, Yizhuo
    Yuan, Shenghai
    Cao, Muqing
    Yang, Jianfei
    Xie, Lihua
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1871 - 1877
  • [37] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
    Sun, Licai
    Lian, Zheng
    Liu, Bin
    Tao, Jianhua
    INFORMATION FUSION, 2024, 108
  • [38] Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
    Kim, Ui-Hyun
    INTERSPEECH 2021, 2021, : 326 - 330
  • [39] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [40] Audio-visual sensing from a quadcopter: dataset and baselines for source localization and sound enhancement
    Wang, Lin
    Sanchez-Matilla, Ricardo
    Cavallaro, Andrea
    2019 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2019, : 5320 - 5325