Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

被引:0
|
作者
Korbar, Bruno [1 ]
Du Tran [2 ]
Torresani, Lorenzo [1 ]
机构
[1] Dartmouth Coll, Hanover, NH 03755 USA
[2] Facebook Res, Menlo Pk, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Audio self-supervised learning: A survey
    Liu, Shuo
    Mallol-Ragolta, Adria
    Parada-Cabaleiro, Emilia
    Qian, Kun
    Jing, Xin
    Kathan, Alexander
    Hu, Bin
    Schuller, Bjorn W.
    [J]. PATTERNS, 2022, 3 (12):
  • [2] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
    Akbari, Hassan
    Yuan, Liangzhe
    Qian, Rui
    Chuang, Wei-Hong
    Chang, Shih-Fu
    Cui, Yin
    Gong, Boqing
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [3] Self-Supervised Learning by Cross-Modal Audio-Video Clustering
    Alwassel, Humam
    Mahajan, Dhruv
    Korbar, Bruno
    Torresani, Lorenzo
    Ghanem, Bernard
    Tran, Du
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [4] Cascaded Siamese Self-supervised Audio to Video GAN
    Aldausari, Nuha
    Sowmya, Arcot
    Marcus, Nadine
    Mohammadi, Gelareh
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4690 - 4699
  • [5] Self-Supervised Generation of Spatial Audio for 360° Video
    Morgado, Pedro
    Vasconcelos, Nuno
    Langlois, Timothy
    Wang, Oliver
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [6] Self-supervised learning of class embeddings from video
    Wiles, Olivia
    Koepke, A. Sophia
    Zisserman, Andrew
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3019 - 3027
  • [7] The Efficacy of Self-Supervised Speech Models as Audio Representations
    Wu, Tung-Yu
    Hsu, Tsu-Yuan
    Li, Chen-An
    Lin, Tzu-Han
    Lee, Hung-yi
    [J]. HEAR: HOLISTIC EVALUATION OF AUDIO REPRESENTATIONS, VOL 166, 2021, 166 : 90 - 110
  • [8] Self-Supervised Learning of Audio Representations From Permutations With Differentiable Ranking
    Carr, Andrew N.
    Berthet, Quentin
    Blondel, Mathieu
    Teboul, Olivier
    Zeghidour, Neil
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 708 - 712
  • [9] WaveBYOL: Self-Supervised Learning for Audio Representation From Raw Waveforms
    Kim, Sunghyun
    Choi, Yong-Hoon
    [J]. IEEE ACCESS, 2023, 11 : 8968 - 8977
  • [10] Self-supervised Learning for Endoscopic Video Analysis
    Hirsch, Roy
    Caron, Mathilde
    Cohen, Regev
    Livne, Amir
    Shapiro, Ron
    Golany, Tomer
    Goldenberg, Roman
    Freedman, Daniel
    Rivlin, Ehud
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 569 - 578