Long-Short Temporal Contrastive Learning of Video Transformers

被引:12
|
作者
Wang, Jue [1 ]
Bertasius, Gedas [2 ]
Tran, Du [1 ]
Torresani, Lorenzo [1 ,3 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] Univ N Carolina, Chapel Hill, NC USA
[3] Dartmouth, Hanover, NH USA
关键词
D O I
10.1109/CVPR52688.2022.01362
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.
引用
收藏
页码:13990 / 14000
页数:11
相关论文
共 50 条
  • [1] Long-Short Transformer: Efficient Transformers for Language and Vision
    Zhu, Chen
    Ping, Wei
    Xiao, Chaowei
    Shoeybi, Mohammad
    Goldstein, Tom
    Anandkumar, Anima
    Catanzaro, Bryan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [2] Efficient Long-Short Temporal Attention network for unsupervised Video Object Segmentation
    Li, Ping
    Zhang, Yu
    Yuan, Li
    Xiao, Huaxin
    Lin, Binbin
    Xu, Xianghua
    PATTERN RECOGNITION, 2024, 146
  • [3] Long-Short Temporal Co-Teaching for Weakly Supervised Video Anomaly Detection
    Sun, Shengyang
    Gong, Xiaojin
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2711 - 2716
  • [4] MULTIVIEW LONG-SHORT SPATIAL CONTRASTIVE LEARNING FOR 3D MEDICAL IMAGE ANALYSIS
    Cao, Gongpeng
    Wang, Yiping
    Zhang, Manli
    Zhang, Jing
    Kang, Guixia
    Xu, Xin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1226 - 1230
  • [5] TCLR: Temporal contrastive learning for video representation
    Dave, Ishan
    Gupta, Rohit
    Rizve, Mamshad Nayeem
    Shah, Mubarak
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2022, 219
  • [6] MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition
    Wang, Xiang
    Zhang, Shiwei
    Qing, Zhiwu
    Gao, Changxin
    Zhang, Yingya
    Zhao, Deli
    Sang, Nong
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18011 - 18021
  • [7] Generating Video Description with Long-Short Term Memory
    Li, Shuohao
    Zhang, Jun
    Guo, Qiang
    Lei, Jun
    Tu, Dan
    2016 INTERNATIONAL CONFERENCE ON IMAGE, VISION AND COMPUTING (ICIVC 2016), 2016, : 73 - 78
  • [8] Long Short View Feature Decomposition via Contrastive Video Representation Learning
    Behrmann, Nadine
    Fayyaz, Mohsen
    Gall, Juergen
    Noroozi, Mehdi
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9224 - 9233
  • [9] Long-Short Temporal Modeling for Efficient Action Recognition
    Wu, Liyu
    Zou, Yuexian
    Zhang, Can
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021, 2021-June : 2435 - 2439
  • [10] LONG-SHORT TEMPORAL MODELING FOR EFFICIENT ACTION RECOGNITION
    Wu, Liyu
    Zou, Yuexian
    Zhang, Can
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 2435 - 2439