Self-supervised object detection from audio-visual correspondence

被引:12
|
作者
Afouras, Triantafyllos [1 ,4 ]
Asano, Yuki M. [2 ]
Fagan, Francois [3 ]
Vedaldi, Andrea [3 ]
Metze, Florian [3 ]
机构
[1] Univ Oxford, Oxford, England
[2] Univ Amsterdam, Amsterdam, Netherlands
[3] Meta AI, Menlo Pk, CA USA
[4] FAIR, Oxford, England
关键词
D O I
10.1109/CVPR52688.2022.01032
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.
引用
收藏
页码:10565 / 10576
页数:12
相关论文
共 50 条
  • [31] Self-Supervised Visual Descriptor Learning for Dense Correspondence
    Schmidt, Tanner
    Newcombe, Richard
    Fox, Dieter
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2017, 2 (02): : 420 - 427
  • [32] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
    Li, Yidi
    Liu, Hong
    Tang, Hao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
  • [33] Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization
    Liu, Tianyu
    Zhang, Peng
    Huang, Wei
    Zha, Yufei
    You, Tao
    Zhang, Yanning
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4042 - 4052
  • [34] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
    Cheng, Ying
    Wang, Ruize
    Pan, Zhihao
    Feng, Rui
    Zhang, Yuejie
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
  • [35] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
    Ishikawa, Reina
    Hachiuma, Ryo
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 64346 - 64357
  • [36] Object category detection using audio-visual cues
    Luo, Jie
    Caputo, Barbara
    Zweig, Alon
    Bach, Joerg-Hendrik
    Anemueller, Joern
    COMPUTER VISION SYSTEMS, PROCEEDINGS, 2008, 5008 : 539 - 548
  • [37] Temporal structure and complexity affect audio-visual correspondence detection
    Denison, Rachel N.
    Driver, Jon
    Ruff, Christian C.
    FRONTIERS IN PSYCHOLOGY, 2013, 3
  • [38] Object Detection with Self-Supervised Scene Adaptation
    Zhang, Zekun
    Hoai, Minh
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 21589 - 21599
  • [39] DEPA: Self-Supervised Audio Embedding for Depression Detection
    Zhang, Pingyue
    Wu, Mengyue
    Dinkel, Heinrich
    Yu, Kai
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 135 - 143
  • [40] Audio-Visual Weakly Supervised Approach for Apathy Detection in the Elderly
    Sharma, Garima
    Joshi, Jyoti
    Zeghari, Radia
    Guerchouche, Rachid
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,