Self-supervised object detection from audio-visual correspondence

被引：12

作者：

Afouras, Triantafyllos ^{[1
,4
]}

Asano, Yuki M. ^{[2
]}

Fagan, Francois ^{[3
]}

Vedaldi, Andrea ^{[3
]}

Metze, Florian ^{[3
]}

机构：

[1] Univ Oxford, Oxford, England

[2] Univ Amsterdam, Amsterdam, Netherlands

[3] Meta AI, Menlo Pk, CA USA

[4] FAIR, Oxford, England

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01032

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.

引用

页码：10565 / 10576

页数：12

共 50 条

[41] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
Xuan, Hanyu
Wu, Zhiliang
Yang, Jian
Jiang, Bo
Luo, Lei
Alameda-Pineda, Xavier
Yan, Yan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
[42] Road Condition Anomaly Detection using Self-Supervised Learning from Audio
Gim, U-Ju
2023 IEEE 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, ITSC, 2023, : 675 - 680
[43] HASSOD: Hierarchical Adaptive Self-Supervised Object Detection
Cao, Shengcao
Joshi, Dhiraj
Gui, Liang-Yan
Wang, Yu-Xiong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[44] Self-Supervised Reinforcement Learning for Active Object Detection
Fang, Fen
Liang, Wenyu
Wu, Yan
Xu, Qianli
Lim, Joo-Hwee
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (04): : 10224 - 10231
[45] Egocentric Audio-Visual Object Localization
Huang, Chao
Flan, Yapeng
Kurnar, Anurag
Xu, Chenliang
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22910 - 22921
[46] Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition
Cai, Danwei
Wang, Weiqing
Li, Ming
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1422 - 1435
[47] Weakly-Supervised Audio-Visual Segmentation
Mo, Shentong
Raj, Bhiksha
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[48] Audio self-supervised learning: A survey
Liu, Shuo
Mallol-Ragolta, Adria
Parada-Cabaleiro, Emilia
Qian, Kun
Jing, Xin
Kathan, Alexander
Hu, Bin
Schuller, Bjorn W.
PATTERNS, 2022, 3 (12):
[49] An Audio-Visual System for Object-Based Audio: From Recording to Listening
Coleman, Philip
Franck, Andreas
Francombe, Jon
Liu, Qingju
de Campos, Teofilo
Hughes, Richard J.
Menzies, Dylan
Galvez, Marcos F. Simon
Tang, Yan
Woodcock, James
Jackson, Philip J. B.
Melchior, Frank
Pike, Chris
Fazi, Filippo Maria
Cox, Trevor J.
Hilton, Adrian
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (08) : 1919 - 1931
[50] Audio-visual event detection based on mining of semantic audio-visual labels
Goh, KS
Miyahara, K
Radhakrishan, R
Xiong, ZY
Divakaran, A
STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299

← 1 2 3 4 5 →