A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos

被引:10
|
作者
Xuan, Hanyu [1 ]
Wu, Zhiliang [1 ]
Yang, Jian [1 ]
Yan, Yan [2 ]
Alameda-Pineda, Xavier [3 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing, Peoples R China
[2] IIT, Dept Comp Sci, Chicago, IL 60616 USA
[3] Univ Grenoble Alpes, LJK, Grenoble INP, INRIA,CNRS, F-38000 Grenoble, France
关键词
D O I
10.1109/CVPR52688.2022.00110
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Humans can easily recognize where and how the sound is produced via watching a scene and listening to corresponding audio cues. To achieve such cross-modal perception on machines, existing methods only use the maps generated by interpolation operations to localize the sound source.As semantic object-level localization is more attractive for potential practical applications, we argue that these existing map-based approaches only provide a coarse-grained and indirect description of the sound source. In this paper,we advocate a novel proposal-based paradigm that can directly perform semantic object-level localization, without any manual annotations. We incorporate the global response map as an unsupervised spatial constraint to weight the proposals according to how well they cover the estimated global shape of the sound source. As a result, our proposal-based sound source localization can be cast into a simpler Multiple Instance Learning (MIL) problem by filtering those instances corresponding to large sound-unrelated regions. Our method achieves state-of-the-art (SOTA) performance when compared to several baselines on multiple datasets.
引用
收藏
页码:1019 / 1028
页数:10
相关论文
共 50 条
  • [1] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
  • [2] Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization
    Fedorishin, Dennis
    Mohan, Deen Dayal
    Jawade, Bhavin
    Setlur, Srirangaraj
    Govindaraju, Venu
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2277 - 2286
  • [3] Self-Supervised Incremental Learning for Sound Source Localization in Complex Indoor Environment
    Liu, Hangxin
    Zhang, Zeyu
    Zhu, Yixin
    Zhu, Song-Chun
    [J]. 2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 2599 - 2605
  • [4] StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos
    Dvornik, Nikita
    Hadji, Isma
    Zhang, Ran
    Derpanis, Konstantinos G.
    Wildes, Richard P.
    Jepson, Allan D.
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18952 - 18961
  • [5] Sound Localization by Self-supervised Time Delay Estimation
    Chen, Ziyang
    Fouhey, David F.
    Owens, Andrew
    [J]. COMPUTER VISION, ECCV 2022, PT XXVI, 2022, 13686 : 489 - 508
  • [6] Sound Localization by Self-supervised Time Delay Estimation
    Chen, Ziyang
    Fouhey, David F.
    Owens, Andrew
    [J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2022, 13686 LNCS : 489 - 508
  • [7] Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations
    Zhu, Lingyu
    Rahtu, Esa
    [J]. 2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2171 - 2181
  • [8] Self-Supervised Sound Promotion Method of Sound Localization from Video
    Li, Yang
    Zhao, Xiaoli
    Zhang, Zhuoyao
    [J]. ELECTRONICS, 2023, 12 (17)
  • [9] Self-supervised Underwater Source Localization based on Contrastive Predictive Coding
    Zhu, Xiaoyu
    Dong, Hefeng
    Rossi, Pierluigi Salvo
    Landro, Martin
    [J]. 2021 IEEE SENSORS, 2021,
  • [10] How does Layer Normalization improve Batch Normalization in self-supervised sound source localization?
    Liu, Tianyu
    Zhang, Peng
    Huang, Wei
    Zha, Yufei
    You, Tao
    Zhang, Yanning
    [J]. NEUROCOMPUTING, 2024, 567