Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

被引：0

作者：

Liu, Tianyu ^{[1
,2
]}

Zhang, Peng ^{[1
,2
]}

Huang, Wei ^{[3
]}

Zha, Yufei ^{[1
,2
]}

You, Tao ^{[1
]}

Zhang, Yanning ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Xian, Shanxi, Peoples R China

[2] Northwestern Polytech Univ, Ningbo Inst, Ningbo, Zhejiang, Peoples R China

[3] Nanchang Univ, Nanchang, Jiangxi, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

audio-visual; sound source localization; contrastive learning; modality gap; EYE;

D O I：

10.1145/3581783.3612502

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.

引用

页码：4042 / 4052

页数：11

共 50 条

[21] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
Zuern, Jannik
Burgard, Wolfram
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
[22] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
Feng, Zishun
Tu, Ming
Xia, Rui
Wang, Yuxuan
Krishnamurthy, Ashok
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
[23] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
Terbouche, Hacene
Schoneveld, Liam
Benson, Oisin
Othmani, Alice
IEEE ACCESS, 2022, 10 : 41622 - 41638
[24] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
Sato, Tomoya
Sugano, Yusuke
Sato, Yoichi
IEEE ACCESS, 2022, 10 : 94273 - 94284
[25] Audio-Visual Grouping Network for Sound Localization from Mixtures
Mo, Shentong
Tian, Yapeng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10565 - 10574
[26] A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Mo, Shentong
Morgado, Pedro
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[27] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
Li, Yidi
Liu, Hong
Tang, Hao
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
[28] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
Cheng, Ying
Wang, Ruize
Pan, Zhihao
Feng, Rui
Zhang, Yuejie
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
[29] Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
Pan, Xichen
Chen, Peiyu
Gong, Yichen
Zhou, Helong
Wang, Xinbing
Lin, Zhouhan
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4491 - 4503
[30] Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
Wang, Shanshan
Politis, Archontis
Mesaros, Annamaria
Virtanen, Tuomas
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1467 - 1479

← 1 2 3 4 5 →