Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

被引：0

作者：

Sarkar, Pritam ^{[1
,2
]}

Etemad, Ali ^{[1
]}

机构：

[1] Queens Univ, Kingston, ON, Canada

[2] Vector Inst, Toronto, ON, Canada

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound.

引用

页码：9723 / 9732

页数：10

共 50 条

[1] Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
Tellamekala, Mani Kumar
Valstar, Michel
Pound, Michael
Giesbrecht, Timo
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9912 - 9919
[2] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
Terbouche, Hacene
Schoneveld, Liam
Benson, Oisin
Othmani, Alice
[J]. IEEE ACCESS, 2022, 10 : 41622 - 41638
[3] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
Sun, Chao
Chen, Min
Cheng, Jialiang
Liang, Han
Zhu, Chuanbo
Chen, Jincai
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
[4] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
Feng, Zishun
Tu, Ming
Xia, Rui
Wang, Yuxuan
Krishnamurthy, Ashok
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
[5] Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Alwassel, Humam
Mahajan, Dhruv
Korbar, Bruno
Torresani, Lorenzo
Ghanem, Bernard
Tran, Du
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[6] Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning
Das, Srijan
Ryoo, Michael
[J]. 2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,
[7] Cross-Modal learning for Audio-Visual Video Parsing
Lamba, Jatin
Abhishek
Akula, Jayaprakash
Dabral, Rishabh
Jyothi, Preethi
Ramakrishnan, Ganesh
[J]. INTERSPEECH 2021, 2021, : 1937 - 1941
[8] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
Ding, Yifan
Xu, Yong
Zhang, Shi-Xiong
Cong, Yahuan
Wang, Liqiang
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
[9] Self-Supervised Visual Representations for Cross-Modal Retrieval
Patel, Yash
Gomez, Lluis
Rusinol, Marcal
Karatzas, Dimosthenis
Jawahar, C., V
[J]. ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 182 - 186
[10] Self-Supervised Correlation Learning for Cross-Modal Retrieval
Liu, Yaxin
Wu, Jianlong
Qu, Leigang
Gan, Tian
Yin, Jianhua
Nie, Liqiang
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2851 - 2863

← 1 2 3 4 5 →