Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

被引:0
|
作者
Tellamekala, Mani Kumar [1 ]
Valstar, Michel [1 ]
Pound, Michael [1 ]
Giesbrecht, Timo [2 ]
机构
[1] Univ Nottingham, Sch Comp Sci, Comp Vis Lab, Nottingham, England
[2] Unilever R&D Port Sunlight, Bebington, England
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1109/ICPR48806.2021.9413295
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce 'Audio-Visual Permutative Predictive Coding' (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 8030% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.
引用
收藏
页码:9912 / 9919
页数:8
相关论文
共 50 条
  • [21] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
    Zuern, Jannik
    Burgard, Wolfram
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
  • [22] Mixed Autoencoder for Self-supervised Visual Representation Learning
    Chen, Kai
    Liu, Zhili
    Hong, Lanqing
    Xu, Hang
    Li, Zhenguo
    Yeung, Dit-Yan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22742 - 22751
  • [23] A survey on self-supervised methods for visual representation learning
    Tobias Uelwer
    Jan Robine
    Stefan Sylvius Wagner
    Marc Höftmann
    Eric Upschulte
    Sebastian Konietzny
    Maike Behrendt
    Stefan Harmeling
    Machine Learning, 2025, 114 (4)
  • [24] Scaling and Benchmarking Self-Supervised Visual Representation Learning
    Goyal, Priya
    Mahajan, Dhruv
    Gupta, Abhinav
    Misra, Ishan
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6400 - 6409
  • [25] Self-supervised representation learning by predicting visual permutations
    Zhao, Qilu
    Dong, Junyu
    KNOWLEDGE-BASED SYSTEMS, 2020, 210
  • [26] Self-Supervised Visual Representation Learning with Semantic Grouping
    Wen, Xin
    Zhao, Bingchen
    Zheng, Anlin
    Zhang, Xiangyu
    Qi, Xiaojuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [27] Self-supervised Visual Representation Learning for Histopathological Images
    Yang, Pengshuai
    Hong, Zhiwei
    Yin, Xiaoxu
    Zhu, Chengzhan
    Jiang, Rui
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT II, 2021, 12902 : 47 - 57
  • [28] Transitive Invariance for Self-supervised Visual Representation Learning
    Wang, Xiaolong
    He, Kaiming
    Gupta, Abhinav
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1338 - 1347
  • [29] DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
    Fujita, Yoto
    Bando, Yoshiaki
    Imoto, Keisuke
    Onishi, Masaki
    Yoshii, Kazuyoshi
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2061 - 2067
  • [30] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
    Kurobe, Akiyoshi
    Nakajima, Yoshikatsu
    Kitani, Kris
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 29970 - 29979