Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

被引：0

作者：

Korbar, Bruno ^{[1
]}

Du Tran ^{[2
]}

Torresani, Lorenzo ^{[1
]}

机构：

[1] Dartmouth Coll, Hanover, NH 03755 USA

[2] Facebook Res, Menlo Pk, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018) | 2018年 / 31卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

引用

页数：12

共 50 条

[1] Audio self-supervised learning: A survey
Liu, Shuo
Mallol-Ragolta, Adria
Parada-Cabaleiro, Emilia
Qian, Kun
Jing, Xin
Kathan, Alexander
Hu, Bin
Schuller, Bjorn W.
[J]. PATTERNS, 2022, 3 (12):
[2] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Akbari, Hassan
Yuan, Liangzhe
Qian, Rui
Chuang, Wei-Hong
Chang, Shih-Fu
Cui, Yin
Gong, Boqing
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[3] Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Alwassel, Humam
Mahajan, Dhruv
Korbar, Bruno
Torresani, Lorenzo
Ghanem, Bernard
Tran, Du
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[4] Cascaded Siamese Self-supervised Audio to Video GAN
Aldausari, Nuha
Sowmya, Arcot
Marcus, Nadine
Mohammadi, Gelareh
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4690 - 4699
[5] Self-Supervised Generation of Spatial Audio for 360° Video
Morgado, Pedro
Vasconcelos, Nuno
Langlois, Timothy
Wang, Oliver
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[6] Self-supervised learning of class embeddings from video
Wiles, Olivia
Koepke, A. Sophia
Zisserman, Andrew
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3019 - 3027
[7] The Efficacy of Self-Supervised Speech Models as Audio Representations
Wu, Tung-Yu
Hsu, Tsu-Yuan
Li, Chen-An
Lin, Tzu-Han
Lee, Hung-yi
[J]. HEAR: HOLISTIC EVALUATION OF AUDIO REPRESENTATIONS, VOL 166, 2021, 166 : 90 - 110
[8] Self-Supervised Learning of Audio Representations From Permutations With Differentiable Ranking
Carr, Andrew N.
Berthet, Quentin
Blondel, Mathieu
Teboul, Olivier
Zeghidour, Neil
[J]. IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 708 - 712
[9] WaveBYOL: Self-Supervised Learning for Audio Representation From Raw Waveforms
Kim, Sunghyun
Choi, Yong-Hoon
[J]. IEEE ACCESS, 2023, 11 : 8968 - 8977
[10] Self-supervised Learning for Endoscopic Video Analysis
Hirsch, Roy
Caron, Mathilde
Cohen, Regev
Livne, Amir
Shapiro, Ron
Golany, Tomer
Goldenberg, Roman
Freedman, Daniel
Rivlin, Ehud
[J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 569 - 578

← 1 2 3 4 5 →