Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos

被引:0
|
作者
Xuan, Hanyu [1 ]
Wu, Zhiliang [2 ]
Yang, Jian [3 ]
Jiang, Bo [4 ]
Luo, Lei [3 ]
Alameda-Pineda, Xavier [5 ]
Yan, Yan [6 ]
机构
[1] Anhui Univ, Sch Big Data & Stat, Hefei 230601, Anhui, Peoples R China
[2] Zhejiang Univ, CCAI, Hangzhou 310007, Zhejiang, Peoples R China
[3] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Jiangsu, Peoples R China
[4] Anhui Univ, Sch Comp Sci & Technol, Hefei 230601, Anhui, Peoples R China
[5] Univ Grenoble Alpes, INRIA, F-38000 Grenoble, France
[6] IIT, Dept Comp Sci, Chicago, IL 60616 USA
基金
中国国家自然科学基金;
关键词
Task analysis; Semantics; Visualization; Videos; Annotations; Location awareness; Synchronization; Sound source localization; active contrastive set mining; audio-visual contrastive learning; faulty negatives; global response map; proposal-based method;
D O I
10.1109/TPAMI.2024.3363508
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
By observing a scene and listening to corresponding audio cues, humans can easily recognize where the sound is. To achieve such cross-modal perception on machines, existing methods take advantage of the maps obtained by interpolation operations to localize the sound source. As semantic object-level localization is more attractive for prospective practical applications, we argue that these map-based methods only offer a coarse-grained and indirect description of the sound source. Additionally, these methods utilize a single audio-visual tuple at a time during self-supervised learning, causing the model to lose the crucial chance to reason about the data distribution of large-scale audio-visual samples. Although the introduction of Audio-Visual Contrastive Learning (AVCL) can effectively alleviate this issue, the contrastive set constructed by randomly sampling is based on the assumption that the audio and visual segments from all other videos are not semantically related. Since the resulting contrastive set contains a large number of faulty negatives, we believe that this assumption is rough. In this paper, we advocate a novel proposal-based solution that directly localizes the semantic object-level sound source, without any manual annotations. The Global Response Map (GRM) is incorporated as an unsupervised spatial constraint to filter those instances corresponding to a large number of sound-unrelated regions. As a result, our proposal-based Sound Source Localization (SSL) can be cast into a simpler Multiple Instance Learning (MIL) problem. To overcome the limitation of random sampling in AVCL, we propose a novel Active Contrastive Set Mining (ACSM) to mine the contrastive sets with informative and diverse negatives for robust AVCL. Our approaches achieve state-of-the-art (SOTA) performance when compared to several baselines on multiple SSL datasets with diverse scenarios.
引用
收藏
页码:4896 / 4907
页数:12
相关论文
共 50 条
  • [1] A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Yan, Yan
    Alameda-Pineda, Xavier
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1019 - 1028
  • [2] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
    Liu, Yang
    Tan, Ying
    Lan, Haoyuan
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
  • [3] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
    Krishnamurthy, Sudha
    [J]. ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138
  • [4] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
  • [5] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
    Sato, Tomoya
    Sugano, Yusuke
    Sato, Yoichi
    [J]. IEEE ACCESS, 2022, 10 : 94273 - 94284
  • [6] Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
    Masuyama, Yoshiki
    Bando, Yoshiaki
    Yatabe, Kohei
    Sasaki, Yoko
    Onishi, Masaki
    Oikawa, Yasuhiro
    [J]. 2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 4848 - 4854
  • [7] DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
    Fujita, Yoto
    Bando, Yoshiaki
    Imoto, Keisuke
    Onishi, Masaki
    Yoshii, Kazuyoshi
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2061 - 2067
  • [8] Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization
    Liu, Tianyu
    Zhang, Peng
    Huang, Wei
    Zha, Yufei
    You, Tao
    Zhang, Yanning
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4042 - 4052
  • [9] Robust Self-Supervised Audio-Visual Speech Recognition
    Shi, Bowen
    Hsu, Wei-Ning
    Mohamed, Abdelrahman
    [J]. INTERSPEECH 2022, 2022, : 2118 - 2122
  • [10] Audio-Visual Class Association Based on Two-stage Self-supervised Contrastive Learning towards Robust Scene Analysis
    Suzuki, Kei
    Itoyama, Katsutoshi
    Nishida, Kenji
    Nakadai, Kazuhiro
    [J]. 2023 IEEE/SICE INTERNATIONAL SYMPOSIUM ON SYSTEM INTEGRATION, SII, 2023,