Self-supervised object detection from audio-visual correspondence

被引:12
|
作者
Afouras, Triantafyllos [1 ,4 ]
Asano, Yuki M. [2 ]
Fagan, Francois [3 ]
Vedaldi, Andrea [3 ]
Metze, Florian [3 ]
机构
[1] Univ Oxford, Oxford, England
[2] Univ Amsterdam, Amsterdam, Netherlands
[3] Meta AI, Menlo Pk, CA USA
[4] FAIR, Oxford, England
关键词
D O I
10.1109/CVPR52688.2022.01032
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.
引用
收藏
页码:10565 / 10576
页数:12
相关论文
共 50 条
  • [1] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
    Zuern, Jannik
    Burgard, Wolfram
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
  • [2] Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
    Feng, Chao
    Chen, Ziyang
    Owens, Andrew
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10491 - 10503
  • [3] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
    Rouditchenko, Andrew
    Zhao, Hang
    Gan, Chuang
    McDermott, Josh
    Torralba, Antonio
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361
  • [4] Robust Self-Supervised Audio-Visual Speech Recognition
    Shi, Bowen
    Hsu, Wei-Ning
    Mohamed, Abdelrahman
    INTERSPEECH 2022, 2022, : 2118 - 2122
  • [5] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
    Ding, Yifan
    Xu, Yong
    Zhang, Shi-Xiong
    Cong, Yahuan
    Wang, Liqiang
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
  • [6] Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
    Tellamekala, Mani Kumar
    Valstar, Michel
    Pound, Michael
    Giesbrecht, Timo
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9912 - 9919
  • [7] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
    Liu, Yang
    Tan, Ying
    Lan, Haoyuan
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
  • [8] Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning
    Zhang, Jingran
    Xu, Xing
    Shen, Fumin
    Lu, Huimin
    Lu, Xin
    Shen, Heng Tao
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3351 - 3359
  • [9] Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
    Owens, Andrew
    Efros, Alexei A.
    COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 639 - 658
  • [10] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
    Krishnamurthy, Sudha
    ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138