Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning

被引:1
|
作者
Das, Srijan [1 ]
Ryoo, Michael [2 ]
机构
[1] UNC Charlotte, Charlotte, NC 28223 USA
[2] SUNY Stony Brook, Stony Brook, NY USA
关键词
D O I
10.23919/MVA57639.2023.10216260
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we address the challenge of obtaining large-scale unlabelled video datasets for contrastive representation learning in real-world applications. We present a novel video augmentation technique for self-supervised learning, called Cross-Modal Manifold Cutmix (CMMC), which generates augmented samples by combining different modalities in videos. By embedding a video tesseract into another across two modalities in the feature space, our method enhances the quality of learned video representations. We perform extensive experiments on two small-scale video datasets, UCF101 and HMDB51, for action recognition and video retrieval tasks. Our approach is also shown to be effective on the NTU dataset with limited domain knowledge. Our CMMC achieves comparable performance to other self-supervised methods while using less training data for both downstream tasks.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Self-Supervised Learning by Cross-Modal Audio-Video Clustering
    Alwassel, Humam
    Mahajan, Dhruv
    Korbar, Bruno
    Torresani, Lorenzo
    Ghanem, Bernard
    Tran, Du
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [2] Self-Supervised Correlation Learning for Cross-Modal Retrieval
    Liu, Yaxin
    Wu, Jianlong
    Qu, Leigang
    Gan, Tian
    Yin, Jianhua
    Nie, Liqiang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2851 - 2863
  • [3] Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery
    Wu, Jie Ying
    Tamhane, Aniruddha
    Kazanzides, Peter
    Unberath, Mathias
    [J]. INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2021, 16 (05) : 779 - 787
  • [4] Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery
    Jie Ying Wu
    Aniruddha Tamhane
    Peter Kazanzides
    Mathias Unberath
    [J]. International Journal of Computer Assisted Radiology and Surgery, 2021, 16 : 779 - 787
  • [5] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [6] Trusted 3D self-supervised representation learning with cross-modal settings
    Han, Xu
    Cheng, Haozhe
    Shi, Pengcheng
    Zhu, Jihua
    [J]. MACHINE VISION AND APPLICATIONS, 2024, 35 (04)
  • [7] SELF-SUPERVISED LEARNING WITH CROSS-MODAL TRANSFORMERS FOR EMOTION RECOGNITION
    Khare, Aparna
    Parthasarathy, Srinivas
    Sundaram, Shiva
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 381 - 388
  • [8] Cross-Architecture Self-supervised Video Representation Learning
    Guo, Sheng
    Xiong, Zihua
    Zhong, Yujie
    Wang, Limin
    Guo, Xiaobo
    Han, Bing
    Huang, Weilin
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19248 - 19257
  • [9] CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation
    Mao, Yunyao
    Zhou, Wengang
    Lu, Zhenbo
    Deng, Jiajun
    Li, Houqiang
    [J]. COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 734 - 752
  • [10] Self-supervised incomplete cross-modal hashing retrieval
    Peng, Shouyong
    Yao, Tao
    Li, Ying
    Wang, Gang
    Wang, Lili
    Yan, Zhiming
    [J]. Expert Systems with Applications, 2025, 262