Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

被引：0

作者：

Liu, Tianyu ^{[1
,2
]}

Zhang, Peng ^{[1
,2
]}

Huang, Wei ^{[3
]}

Zha, Yufei ^{[1
,2
]}

You, Tao ^{[1
]}

Zhang, Yanning ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Xian, Shanxi, Peoples R China

[2] Northwestern Polytech Univ, Ningbo Inst, Ningbo, Zhejiang, Peoples R China

[3] Nanchang Univ, Nanchang, Jiangxi, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

audio-visual; sound source localization; contrastive learning; modality gap; EYE;

D O I：

10.1145/3581783.3612502

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.

引用

页码：4042 / 4052

页数：11

共 50 条

[31] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
Kurobe, Akiyoshi
Nakajima, Yoshikatsu
Kitani, Kris
Saito, Hideo
IEEE ACCESS, 2021, 9 : 29970 - 29979
[32] Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization
Fedorishin, Dennis
Mohan, Deen Dayal
Jawade, Bhavin
Setlur, Srirangaraj
Govindaraju, Venu
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2277 - 2286
[33] BI-DIRECTIONAL MODALITY FUSION NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
Liu, Shuo
Quan, Weize
Liu, Yuan
Yan, Dong-Ming
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4868 - 4872
[34] Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization
Um, Sung Jin
Kim, Dongjin
Kim, Jung Uk
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3507 - 3516
[35] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Sun, Licai
Lian, Zheng
Liu, Bin
Tao, Jianhua
Information Fusion, 2024, 108
[36] AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness
Yang, Yizhuo
Yuan, Shenghai
Cao, Muqing
Yang, Jianfei
Xie, Lihua
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1871 - 1877
[37] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Sun, Licai
Lian, Zheng
Liu, Bin
Tao, Jianhua
INFORMATION FUSION, 2024, 108
[38] Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
Kim, Ui-Hyun
INTERSPEECH 2021, 2021, : 326 - 330
[39] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Sarkar, Pritam
Etemad, Ali
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
[40] Audio-visual sensing from a quadcopter: dataset and baselines for source localization and sound enhancement
Wang, Lin
Sanchez-Matilla, Ricardo
Cavallaro, Andrea
2019 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2019, : 5320 - 5325

← 1 2 3 4 5 →