Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

被引:0
|
作者
Liu, Tianyu [1 ,2 ]
Zhang, Peng [1 ,2 ]
Huang, Wei [3 ]
Zha, Yufei [1 ,2 ]
You, Tao [1 ]
Zhang, Yanning [1 ]
机构
[1] Northwestern Polytech Univ, Xian, Shanxi, Peoples R China
[2] Northwestern Polytech Univ, Ningbo Inst, Ningbo, Zhejiang, Peoples R China
[3] Nanchang Univ, Nanchang, Jiangxi, Peoples R China
基金
中国国家自然科学基金;
关键词
audio-visual; sound source localization; contrastive learning; modality gap; EYE;
D O I
10.1145/3581783.3612502
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.
引用
收藏
页码:4042 / 4052
页数:11
相关论文
共 50 条
  • [21] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
    Zuern, Jannik
    Burgard, Wolfram
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
  • [22] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
  • [23] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
    Terbouche, Hacene
    Schoneveld, Liam
    Benson, Oisin
    Othmani, Alice
    IEEE ACCESS, 2022, 10 : 41622 - 41638
  • [24] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
    Sato, Tomoya
    Sugano, Yusuke
    Sato, Yoichi
    IEEE ACCESS, 2022, 10 : 94273 - 94284
  • [25] Audio-Visual Grouping Network for Sound Localization from Mixtures
    Mo, Shentong
    Tian, Yapeng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10565 - 10574
  • [26] A Closer Look at Weakly-Supervised Audio-Visual Source Localization
    Mo, Shentong
    Morgado, Pedro
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [27] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
    Li, Yidi
    Liu, Hong
    Tang, Hao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
  • [28] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
    Cheng, Ying
    Wang, Ruize
    Pan, Zhihao
    Feng, Rui
    Zhang, Yuejie
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
  • [29] Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
    Pan, Xichen
    Chen, Peiyu
    Gong, Yichen
    Zhou, Helong
    Wang, Xinbing
    Lin, Zhouhan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4491 - 4503
  • [30] Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
    Wang, Shanshan
    Politis, Archontis
    Mesaros, Annamaria
    Virtanen, Tuomas
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1467 - 1479