Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos

被引:0
|
作者
Xuan, Hanyu [1 ]
Wu, Zhiliang [2 ]
Yang, Jian [3 ]
Jiang, Bo [4 ]
Luo, Lei [3 ]
Alameda-Pineda, Xavier [5 ]
Yan, Yan [6 ]
机构
[1] Anhui Univ, Sch Big Data & Stat, Hefei 230601, Anhui, Peoples R China
[2] Zhejiang Univ, CCAI, Hangzhou 310007, Zhejiang, Peoples R China
[3] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Jiangsu, Peoples R China
[4] Anhui Univ, Sch Comp Sci & Technol, Hefei 230601, Anhui, Peoples R China
[5] Univ Grenoble Alpes, INRIA, F-38000 Grenoble, France
[6] IIT, Dept Comp Sci, Chicago, IL 60616 USA
基金
中国国家自然科学基金;
关键词
Task analysis; Semantics; Visualization; Videos; Annotations; Location awareness; Synchronization; Sound source localization; active contrastive set mining; audio-visual contrastive learning; faulty negatives; global response map; proposal-based method;
D O I
10.1109/TPAMI.2024.3363508
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
By observing a scene and listening to corresponding audio cues, humans can easily recognize where the sound is. To achieve such cross-modal perception on machines, existing methods take advantage of the maps obtained by interpolation operations to localize the sound source. As semantic object-level localization is more attractive for prospective practical applications, we argue that these map-based methods only offer a coarse-grained and indirect description of the sound source. Additionally, these methods utilize a single audio-visual tuple at a time during self-supervised learning, causing the model to lose the crucial chance to reason about the data distribution of large-scale audio-visual samples. Although the introduction of Audio-Visual Contrastive Learning (AVCL) can effectively alleviate this issue, the contrastive set constructed by randomly sampling is based on the assumption that the audio and visual segments from all other videos are not semantically related. Since the resulting contrastive set contains a large number of faulty negatives, we believe that this assumption is rough. In this paper, we advocate a novel proposal-based solution that directly localizes the semantic object-level sound source, without any manual annotations. The Global Response Map (GRM) is incorporated as an unsupervised spatial constraint to filter those instances corresponding to a large number of sound-unrelated regions. As a result, our proposal-based Sound Source Localization (SSL) can be cast into a simpler Multiple Instance Learning (MIL) problem. To overcome the limitation of random sampling in AVCL, we propose a novel Active Contrastive Set Mining (ACSM) to mine the contrastive sets with informative and diverse negatives for robust AVCL. Our approaches achieve state-of-the-art (SOTA) performance when compared to several baselines on multiple SSL datasets with diverse scenarios.
引用
收藏
页码:4896 / 4907
页数:12
相关论文
共 50 条
  • [21] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
    Rouditchenko, Andrew
    Zhao, Hang
    Gan, Chuang
    McDermott, Josh
    Torralba, Antonio
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361
  • [22] Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization
    Um, Sung Jin
    Kim, Dongjin
    Kim, Jung Uk
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3507 - 3516
  • [23] Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
    Pan, Xichen
    Chen, Peiyu
    Gong, Yichen
    Zhou, Helong
    Wang, Xinbing
    Lin, Zhouhan
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4491 - 4503
  • [24] Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
    Wang, Shanshan
    Politis, Archontis
    Mesaros, Annamaria
    Virtanen, Tuomas
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1467 - 1479
  • [25] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
    Ran, Yue
    Tang, Hongying
    Li, Baoqing
    Wang, Guohui
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (24):
  • [26] Self-supervised Underwater Source Localization based on Contrastive Predictive Coding
    Zhu, Xiaoyu
    Dong, Hefeng
    Rossi, Pierluigi Salvo
    Landro, Martin
    [J]. 2021 IEEE SENSORS, 2021,
  • [27] Simple contrastive learning in a self-supervised manner for robust visual question answering
    Yang, Shuwen
    Xiao, Luwei
    Wu, Xingjiao
    Xu, Junjie
    Wang, Linlin
    He, Liang
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 241
  • [28] Semantic Pose Verification for Outdoor Visual Localization with Self-supervised Contrastive Learning
    Orhan, Semih
    Guerrero, Jose J.
    Bastanlar, Yalin
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 3988 - 3997
  • [29] Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
    Feng, Chao
    Chen, Ziyang
    Owens, Andrew
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10491 - 10503
  • [30] Self-supervised object detection from audio-visual correspondence
    Afouras, Triantafyllos
    Asano, Yuki M.
    Fagan, Francois
    Vedaldi, Andrea
    Metze, Florian
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10565 - 10576