Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning

被引：1

作者：

Das, Srijan ^{[1
]}

Ryoo, Michael ^{[2
]}

机构：

[1] UNC Charlotte, Charlotte, NC 28223 USA

[2] SUNY Stony Brook, Stony Brook, NY USA

来源：

2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA | 2023年

关键词：

D O I：

10.23919/MVA57639.2023.10216260

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we address the challenge of obtaining large-scale unlabelled video datasets for contrastive representation learning in real-world applications. We present a novel video augmentation technique for self-supervised learning, called Cross-Modal Manifold Cutmix (CMMC), which generates augmented samples by combining different modalities in videos. By embedding a video tesseract into another across two modalities in the feature space, our method enhances the quality of learned video representations. We perform extensive experiments on two small-scale video datasets, UCF101 and HMDB51, for action recognition and video retrieval tasks. Our approach is also shown to be effective on the NTU dataset with limited domain knowledge. Our CMMC achieves comparable performance to other self-supervised methods while using less training data for both downstream tasks.

引用

页数：6

共 50 条

[1] Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Alwassel, Humam
Mahajan, Dhruv
Korbar, Bruno
Torresani, Lorenzo
Ghanem, Bernard
Tran, Du
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[2] Self-Supervised Correlation Learning for Cross-Modal Retrieval
Liu, Yaxin
Wu, Jianlong
Qu, Leigang
Gan, Tian
Yin, Jianhua
Nie, Liqiang
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2851 - 2863
[3] Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery
Wu, Jie Ying
Tamhane, Aniruddha
Kazanzides, Peter
Unberath, Mathias
[J]. INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2021, 16 (05) : 779 - 787
[4] Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery
Jie Ying Wu
Aniruddha Tamhane
Peter Kazanzides
Mathias Unberath
[J]. International Journal of Computer Assisted Radiology and Surgery, 2021, 16 : 779 - 787
[5] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Sarkar, Pritam
Etemad, Ali
[J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
[6] Trusted 3D self-supervised representation learning with cross-modal settings
Han, Xu
Cheng, Haozhe
Shi, Pengcheng
Zhu, Jihua
[J]. MACHINE VISION AND APPLICATIONS, 2024, 35 (04)
[7] SELF-SUPERVISED LEARNING WITH CROSS-MODAL TRANSFORMERS FOR EMOTION RECOGNITION
Khare, Aparna
Parthasarathy, Srinivas
Sundaram, Shiva
[J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 381 - 388
[8] Cross-Architecture Self-supervised Video Representation Learning
Guo, Sheng
Xiong, Zihua
Zhong, Yujie
Wang, Limin
Guo, Xiaobo
Han, Bing
Huang, Weilin
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19248 - 19257
[9] CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation
Mao, Yunyao
Zhou, Wengang
Lu, Zhenbo
Deng, Jiajun
Li, Houqiang
[J]. COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 734 - 752
[10] Self-supervised incomplete cross-modal hashing retrieval
Peng, Shouyong
Yao, Tao
Li, Ying
Wang, Gang
Wang, Lili
Yan, Zhiming
[J]. Expert Systems with Applications, 2025, 262

← 1 2 3 4 5 →