Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

被引:0
|
作者
Tellamekala, Mani Kumar [1 ]
Valstar, Michel [1 ]
Pound, Michael [1 ]
Giesbrecht, Timo [2 ]
机构
[1] Univ Nottingham, Sch Comp Sci, Comp Vis Lab, Nottingham, England
[2] Unilever R&D Port Sunlight, Bebington, England
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1109/ICPR48806.2021.9413295
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce 'Audio-Visual Permutative Predictive Coding' (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 8030% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.
引用
收藏
页码:9912 / 9919
页数:8
相关论文
共 50 条
  • [1] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
    Terbouche, Hacene
    Schoneveld, Liam
    Benson, Oisin
    Othmani, Alice
    IEEE ACCESS, 2022, 10 : 41622 - 41638
  • [2] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
  • [3] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
    Ding, Yifan
    Xu, Yong
    Zhang, Shi-Xiong
    Cong, Yahuan
    Wang, Liqiang
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
  • [4] Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning
    Zhang, Jingran
    Xu, Xing
    Shen, Fumin
    Lu, Huimin
    Lu, Xin
    Shen, Heng Tao
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3351 - 3359
  • [5] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
    Liu, Yang
    Tan, Ying
    Lan, Haoyuan
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
  • [6] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [7] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
    Krishnamurthy, Sudha
    ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138
  • [8] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
    Sato, Tomoya
    Sugano, Yusuke
    Sato, Yoichi
    IEEE ACCESS, 2022, 10 : 94273 - 94284
  • [9] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
    Rouditchenko, Andrew
    Zhao, Hang
    Gan, Chuang
    McDermott, Josh
    Torralba, Antonio
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361
  • [10] Robust Self-Supervised Audio-Visual Speech Recognition
    Shi, Bowen
    Hsu, Wei-Ning
    Mohamed, Abdelrahman
    INTERSPEECH 2022, 2022, : 2118 - 2122