Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

被引:0
|
作者
Tellamekala, Mani Kumar [1 ]
Valstar, Michel [1 ]
Pound, Michael [1 ]
Giesbrecht, Timo [2 ]
机构
[1] Univ Nottingham, Sch Comp Sci, Comp Vis Lab, Nottingham, England
[2] Unilever R&D Port Sunlight, Bebington, England
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1109/ICPR48806.2021.9413295
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce 'Audio-Visual Permutative Predictive Coding' (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 8030% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.
引用
收藏
页码:9912 / 9919
页数:8
相关论文
共 50 条
  • [41] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
  • [42] Enhancing motion visual cues for self-supervised video representation learning
    Nie, Mu
    Quan, Zhibin
    Ding, Weiping
    Yang, Wankou
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
  • [43] Enhancing semantic audio-visual representation learning with supervised multi-scale attention
    Jiwei Zhang
    Yi Yu
    Suhua Tang
    GuoJun Qi
    Haiyuan Wu
    Hirotaka Hachiya
    Pattern Analysis and Applications, 2025, 28 (2)
  • [44] Can Semantic Labels Assist Self-Supervised Visual Representation Learning?
    Wei, Longhui
    Xie, Lingxi
    He, Jianzhong
    Zhang, Xiaopeng
    Tian, Qi
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2642 - 2650
  • [45] MULTI-AUGMENTATION FOR EFFICIENT SELF-SUPERVISED VISUAL REPRESENTATION LEARNING
    Tran, Van Nhiem
    Huang, Chi-En
    Liu, Shen-Hsuan
    Yang, Kai-Lin
    Ko, Timothy
    Li, Yung-Hui
    2022 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (IEEE ICMEW 2022), 2022,
  • [46] Boost Supervised Pretraining for Visual Transfer Learning: Implications of Self-Supervised Contrastive Representation Learning
    Sun, Jinghan
    Wei, Dong
    Ma, Kai
    Wang, Liansheng
    Zheng, Yefeng
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2307 - 2315
  • [47] Audio-guided self-supervised learning for disentangled visual speech representations
    Feng, Dalu
    Yang, Shuang
    Shan, Shiguang
    Chen, Xilin
    FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (06)
  • [48] Audio-guided self-supervised learning for disentangled visual speech representations
    FENG Dalu
    YANG Shuang
    SHAN Shiguang
    CHEN Xilin
    Frontiers of Computer Science, 2024, 18 (06)
  • [49] AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS
    Chern, I-Chun
    Hung, Kuo-Hsuan
    Chen, Yi-Ting
    Hussain, Tassadaq
    Gogate, Mandar
    Hussain, Amir
    Tsao, Yu
    Hou, Jen-Cheng
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [50] Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
    Masuyama, Yoshiki
    Bando, Yoshiaki
    Yatabe, Kohei
    Sasaki, Yoko
    Onishi, Masaki
    Oikawa, Yasuhiro
    2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 4848 - 4854