Self-supervised object detection from audio-visual correspondence

被引:12
|
作者
Afouras, Triantafyllos [1 ,4 ]
Asano, Yuki M. [2 ]
Fagan, Francois [3 ]
Vedaldi, Andrea [3 ]
Metze, Florian [3 ]
机构
[1] Univ Oxford, Oxford, England
[2] Univ Amsterdam, Amsterdam, Netherlands
[3] Meta AI, Menlo Pk, CA USA
[4] FAIR, Oxford, England
关键词
D O I
10.1109/CVPR52688.2022.01032
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.
引用
收藏
页码:10565 / 10576
页数:12
相关论文
共 50 条
  • [41] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
  • [42] Road Condition Anomaly Detection using Self-Supervised Learning from Audio
    Gim, U-Ju
    2023 IEEE 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, ITSC, 2023, : 675 - 680
  • [43] HASSOD: Hierarchical Adaptive Self-Supervised Object Detection
    Cao, Shengcao
    Joshi, Dhiraj
    Gui, Liang-Yan
    Wang, Yu-Xiong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [44] Self-Supervised Reinforcement Learning for Active Object Detection
    Fang, Fen
    Liang, Wenyu
    Wu, Yan
    Xu, Qianli
    Lim, Joo-Hwee
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (04): : 10224 - 10231
  • [45] Egocentric Audio-Visual Object Localization
    Huang, Chao
    Flan, Yapeng
    Kurnar, Anurag
    Xu, Chenliang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22910 - 22921
  • [46] Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition
    Cai, Danwei
    Wang, Weiqing
    Li, Ming
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1422 - 1435
  • [47] Weakly-Supervised Audio-Visual Segmentation
    Mo, Shentong
    Raj, Bhiksha
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [48] Audio self-supervised learning: A survey
    Liu, Shuo
    Mallol-Ragolta, Adria
    Parada-Cabaleiro, Emilia
    Qian, Kun
    Jing, Xin
    Kathan, Alexander
    Hu, Bin
    Schuller, Bjorn W.
    PATTERNS, 2022, 3 (12):
  • [49] An Audio-Visual System for Object-Based Audio: From Recording to Listening
    Coleman, Philip
    Franck, Andreas
    Francombe, Jon
    Liu, Qingju
    de Campos, Teofilo
    Hughes, Richard J.
    Menzies, Dylan
    Galvez, Marcos F. Simon
    Tang, Yan
    Woodcock, James
    Jackson, Philip J. B.
    Melchior, Frank
    Pike, Chris
    Fazi, Filippo Maria
    Cox, Trevor J.
    Hilton, Adrian
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (08) : 1919 - 1931
  • [50] Audio-visual event detection based on mining of semantic audio-visual labels
    Goh, KS
    Miyahara, K
    Radhakrishan, R
    Xiong, ZY
    Divakaran, A
    STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299