Self-supervised object detection from audio-visual correspondence

被引：12

作者：

Afouras, Triantafyllos ^{[1
,4
]}

Asano, Yuki M. ^{[2
]}

Fagan, Francois ^{[3
]}

Vedaldi, Andrea ^{[3
]}

Metze, Florian ^{[3
]}

机构：

[1] Univ Oxford, Oxford, England

[2] Univ Amsterdam, Amsterdam, Netherlands

[3] Meta AI, Menlo Pk, CA USA

[4] FAIR, Oxford, England

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01032

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.

引用

页码：10565 / 10576

页数：12

共 50 条

[1] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
Zuern, Jannik
Burgard, Wolfram
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
[2] Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
Feng, Chao
Chen, Ziyang
Owens, Andrew
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10491 - 10503
[3] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
Rouditchenko, Andrew
Zhao, Hang
Gan, Chuang
McDermott, Josh
Torralba, Antonio
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361
[4] Robust Self-Supervised Audio-Visual Speech Recognition
Shi, Bowen
Hsu, Wei-Ning
Mohamed, Abdelrahman
INTERSPEECH 2022, 2022, : 2118 - 2122
[5] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
Ding, Yifan
Xu, Yong
Zhang, Shi-Xiong
Cong, Yahuan
Wang, Liqiang
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
[6] Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
Tellamekala, Mani Kumar
Valstar, Michel
Pound, Michael
Giesbrecht, Timo
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9912 - 9919
[7] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
Liu, Yang
Tan, Ying
Lan, Haoyuan
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
[8] Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning
Zhang, Jingran
Xu, Xing
Shen, Fumin
Lu, Huimin
Lu, Xin
Shen, Heng Tao
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3351 - 3359
[9] Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Owens, Andrew
Efros, Alexei A.
COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 639 - 658
[10] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
Krishnamurthy, Sudha
ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138

← 1 2 3 4 5 →