Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

被引：0

作者：

Tellamekala, Mani Kumar ^{[1
]}

Valstar, Michel ^{[1
]}

Pound, Michael ^{[1
]}

Giesbrecht, Timo ^{[2
]}

机构：

[1] Univ Nottingham, Sch Comp Sci, Comp Vis Lab, Nottingham, England

[2] Unilever R&D Port Sunlight, Bebington, England

来源：

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) | 2021年

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

10.1109/ICPR48806.2021.9413295

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce 'Audio-Visual Permutative Predictive Coding' (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 8030% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.

引用

页码：9912 / 9919

页数：8

共 50 条

[21] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
Zuern, Jannik
Burgard, Wolfram
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
[22] Mixed Autoencoder for Self-supervised Visual Representation Learning
Chen, Kai
Liu, Zhili
Hong, Lanqing
Xu, Hang
Li, Zhenguo
Yeung, Dit-Yan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22742 - 22751
[23] A survey on self-supervised methods for visual representation learning
Tobias Uelwer
Jan Robine
Stefan Sylvius Wagner
Marc Höftmann
Eric Upschulte
Sebastian Konietzny
Maike Behrendt
Stefan Harmeling
Machine Learning, 2025, 114 (4)
[24] Scaling and Benchmarking Self-Supervised Visual Representation Learning
Goyal, Priya
Mahajan, Dhruv
Gupta, Abhinav
Misra, Ishan
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6400 - 6409
[25] Self-supervised representation learning by predicting visual permutations
Zhao, Qilu
Dong, Junyu
KNOWLEDGE-BASED SYSTEMS, 2020, 210
[26] Self-Supervised Visual Representation Learning with Semantic Grouping
Wen, Xin
Zhao, Bingchen
Zheng, Anlin
Zhang, Xiangyu
Qi, Xiaojuan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[27] Self-supervised Visual Representation Learning for Histopathological Images
Yang, Pengshuai
Hong, Zhiwei
Yin, Xiaoxu
Zhu, Chengzhan
Jiang, Rui
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT II, 2021, 12902 : 47 - 57
[28] Transitive Invariance for Self-supervised Visual Representation Learning
Wang, Xiaolong
He, Kaiming
Gupta, Abhinav
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1338 - 1347
[29] DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
Fujita, Yoto
Bando, Yoshiaki
Imoto, Keisuke
Onishi, Masaki
Yoshii, Kazuyoshi
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2061 - 2067
[30] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
Kurobe, Akiyoshi
Nakajima, Yoshikatsu
Kitani, Kris
Saito, Hideo
IEEE ACCESS, 2021, 9 : 29970 - 29979

← 1 2 3 4 5 →