Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

被引:0
|
作者
Korbar, Bruno [1 ]
Du Tran [2 ]
Torresani, Lorenzo [1 ]
机构
[1] Dartmouth Coll, Hanover, NH 03755 USA
[2] Facebook Res, Menlo Pk, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
    Niizumi, Daisuke
    Takeuchi, Daiki
    Ohishi, Yasunori
    Harada, Noboru
    Kashino, Kunio
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [22] Audio Mixing Inversion via Embodied Self-supervised Learning
    Zhou, Haotian
    Yu, Feng
    Wu, Xihong
    [J]. MACHINE INTELLIGENCE RESEARCH, 2024, 21 (01) : 55 - 62
  • [23] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
    Ding, Yifan
    Xu, Yong
    Zhang, Shi-Xiong
    Cong, Yahuan
    Wang, Liqiang
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
  • [24] Audio Mixing Inversion via Embodied Self-supervised Learning
    Haotian Zhou
    Feng Yu
    Xihong Wu
    [J]. Machine Intelligence Research, 2024, 21 : 55 - 62
  • [25] Contrast and Order Representations for Video Self-supervised Learning
    Hu, Kai
    Shao, Jie
    Liu, Yuan
    Raj, Bhiksha
    Savvides, Marios
    Shen, Zhiqiang
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 7919 - 7929
  • [26] Self-Supervised Learning for Action Recognition by Video Denoising
    Thi Thu Trang Phung
    Thi Hong Thu Ma
    Van Truong Nguyen
    Duc Quang Vu
    [J]. 2021 RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES (RIVF 2021), 2021, : 76 - 81
  • [27] Self-Supervised Video Defocus Deblurring with Atlas Learning
    Ruan, Lingyan
    Balint, Martin
    Bemana, Mojtaba
    Wolski, Krzysztof
    Seidel, Hans-Peter
    Myszkowski, Karol
    Chen, Bin
    [J]. PROCEEDINGS OF SIGGRAPH 2024 CONFERENCE PAPERS, 2024,
  • [28] Video Face Clustering with Self-Supervised Representation Learning
    Sharma V.
    Tapaswi M.
    Saquib Sarfraz M.
    Stiefelhagen R.
    [J]. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2020, 2 (02): : 145 - 157
  • [29] Broaden Your Views for Self-Supervised Video Learning
    Recasens, Adria
    Luc, Pauline
    Alayrac, Jean-Baptiste
    Wang, Luyu
    Strub, Florian
    Tallec, Corentin
    Malinowski, Mateusz
    Patraaucean, Viorica
    Altche, Florent
    Valko, Michal
    Grill, Jean-Bastien
    van den Oord, Aaron
    Zisserman, Andrew
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1235 - 1245
  • [30] Self-Supervised Representation Learning for Video Quality Assessment
    Jiang, Shaojie
    Sang, Qingbing
    Hu, Zongyao
    Liu, Lixiong
    [J]. IEEE TRANSACTIONS ON BROADCASTING, 2023, 69 (01) : 118 - 129