Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos

被引:0
|
作者
Xuan, Hanyu [1 ]
Wu, Zhiliang [2 ]
Yang, Jian [3 ]
Jiang, Bo [4 ]
Luo, Lei [3 ]
Alameda-Pineda, Xavier [5 ]
Yan, Yan [6 ]
机构
[1] Anhui Univ, Sch Big Data & Stat, Hefei 230601, Anhui, Peoples R China
[2] Zhejiang Univ, CCAI, Hangzhou 310007, Zhejiang, Peoples R China
[3] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Jiangsu, Peoples R China
[4] Anhui Univ, Sch Comp Sci & Technol, Hefei 230601, Anhui, Peoples R China
[5] Univ Grenoble Alpes, INRIA, F-38000 Grenoble, France
[6] IIT, Dept Comp Sci, Chicago, IL 60616 USA
基金
中国国家自然科学基金;
关键词
Task analysis; Semantics; Visualization; Videos; Annotations; Location awareness; Synchronization; Sound source localization; active contrastive set mining; audio-visual contrastive learning; faulty negatives; global response map; proposal-based method;
D O I
10.1109/TPAMI.2024.3363508
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
By observing a scene and listening to corresponding audio cues, humans can easily recognize where the sound is. To achieve such cross-modal perception on machines, existing methods take advantage of the maps obtained by interpolation operations to localize the sound source. As semantic object-level localization is more attractive for prospective practical applications, we argue that these map-based methods only offer a coarse-grained and indirect description of the sound source. Additionally, these methods utilize a single audio-visual tuple at a time during self-supervised learning, causing the model to lose the crucial chance to reason about the data distribution of large-scale audio-visual samples. Although the introduction of Audio-Visual Contrastive Learning (AVCL) can effectively alleviate this issue, the contrastive set constructed by randomly sampling is based on the assumption that the audio and visual segments from all other videos are not semantically related. Since the resulting contrastive set contains a large number of faulty negatives, we believe that this assumption is rough. In this paper, we advocate a novel proposal-based solution that directly localizes the semantic object-level sound source, without any manual annotations. The Global Response Map (GRM) is incorporated as an unsupervised spatial constraint to filter those instances corresponding to a large number of sound-unrelated regions. As a result, our proposal-based Sound Source Localization (SSL) can be cast into a simpler Multiple Instance Learning (MIL) problem. To overcome the limitation of random sampling in AVCL, we propose a novel Active Contrastive Set Mining (ACSM) to mine the contrastive sets with informative and diverse negatives for robust AVCL. Our approaches achieve state-of-the-art (SOTA) performance when compared to several baselines on multiple SSL datasets with diverse scenarios.
引用
收藏
页码:4896 / 4907
页数:12
相关论文
共 50 条
  • [41] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
    Sun, Chao
    Chen, Min
    Cheng, Jialiang
    Liang, Han
    Zhu, Chuanbo
    Chen, Jincai
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
  • [42] Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition
    Assefa, Maregu
    Jiang, Wei
    Zhan, Jinyu
    Gedamu, Kumie
    Yilma, Getinet
    Ayalew, Melese
    Adhikari, Deepak
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3491 - 3504
  • [43] A Closer Look at Weakly-Supervised Audio-Visual Source Localization
    Mo, Shentong
    Morgado, Pedro
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [44] Self-Supervised Visual Representations Learning by Contrastive Mask Prediction
    Zhao, Yucheng
    Wang, Guangting
    Luo, Chong
    Zeng, Wenjun
    Zha, Zheng-Jun
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10140 - 10149
  • [45] PointCMP: Contrastive Mask Prediction for Self-supervised Learning on Point Cloud Videos
    Shen, Zhiqiang
    Sheng, Xiaoxiao
    Wang, Longguang
    Guo, Yulan
    Liu, Qiong
    Zhou, Xi
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 1212 - 1222
  • [46] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
    Kurobe, Akiyoshi
    Nakajima, Yoshikatsu
    Kitani, Kris
    Saito, Hideo
    [J]. IEEE ACCESS, 2021, 9 : 29970 - 29979
  • [47] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
    Li, Yidi
    Liu, Hong
    Tang, Hao
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
  • [48] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
    Cheng, Ying
    Wang, Ruize
    Pan, Zhihao
    Feng, Rui
    Zhang, Yuejie
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
  • [49] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
    Ishikawa, Reina
    Hachiuma, Ryo
    Saito, Hideo
    [J]. IEEE ACCESS, 2021, 9 : 64346 - 64357
  • [50] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
    Ishikawa, Reina
    Hachiuma, Ryo
    Kurobe, Akiyoshi
    Saito, Hideo
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406