Self-supervised object detection from audio-visual correspondence

被引:12
|
作者
Afouras, Triantafyllos [1 ,4 ]
Asano, Yuki M. [2 ]
Fagan, Francois [3 ]
Vedaldi, Andrea [3 ]
Metze, Florian [3 ]
机构
[1] Univ Oxford, Oxford, England
[2] Univ Amsterdam, Amsterdam, Netherlands
[3] Meta AI, Menlo Pk, CA USA
[4] FAIR, Oxford, England
关键词
D O I
10.1109/CVPR52688.2022.01032
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.
引用
收藏
页码:10565 / 10576
页数:12
相关论文
共 50 条
  • [21] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
    Sun, Licai
    Lian, Zheng
    Liu, Bin
    Tao, Jianhua
    INFORMATION FUSION, 2024, 108
  • [22] AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness
    Yang, Yizhuo
    Yuan, Shenghai
    Cao, Muqing
    Yang, Jianfei
    Xie, Lihua
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1871 - 1877
  • [23] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
    Ishikawa, Reina
    Hachiuma, Ryo
    Kurobe, Akiyoshi
    Saito, Hideo
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
  • [24] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
    Ran, Yue
    Tang, Hongying
    Li, Baoqing
    Wang, Guohui
    APPLIED SCIENCES-BASEL, 2022, 12 (24):
  • [25] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [26] Weakly Supervised Audio-Visual Violence Detection
    Wu, Peng
    Liu, Xiaotao
    Liu, Jing
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1674 - 1685
  • [27] Self-Supervised Object Detection from Egocentric Videos
    Akiva, Peri
    Huang, Jing
    Liang, Kevin J.
    Kovvuri, Rama
    Chen, Xingyu
    Feiszli, Matt
    Dana, Kristin
    Hassner, Tal
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5202 - 5214
  • [28] AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS
    Chern, I-Chun
    Hung, Kuo-Hsuan
    Chen, Yi-Ting
    Hussain, Tassadaq
    Gogate, Mandar
    Hussain, Amir
    Tsao, Yu
    Hou, Jen-Cheng
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [29] Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
    Masuyama, Yoshiki
    Bando, Yoshiaki
    Yatabe, Kohei
    Sasaki, Yoko
    Onishi, Masaki
    Oikawa, Yasuhiro
    2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 4848 - 4854
  • [30] Self-supervised Spoofing Audio Detection Scheme
    Jiang, Ziyue
    Zhu, Hongcheng
    Peng, Li
    Ding, Wenbing
    Ren, Yanzhen
    INTERSPEECH 2020, 2020, : 4223 - 4227