Long-Short Temporal Contrastive Learning of Video Transformers

被引:12
|
作者
Wang, Jue [1 ]
Bertasius, Gedas [2 ]
Tran, Du [1 ]
Torresani, Lorenzo [1 ,3 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] Univ N Carolina, Chapel Hill, NC USA
[3] Dartmouth, Hanover, NH USA
关键词
D O I
10.1109/CVPR52688.2022.01362
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.
引用
收藏
页码:13990 / 14000
页数:11
相关论文
共 50 条
  • [31] Compressed Video Contrastive Learning
    Huo, Yuqi
    Ding, Mingyu
    Lu, Haoyu
    Fei, Nanyi
    Lu, Zhiwu
    Wen, Ji-Rong
    Luo, Ping
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [32] TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning
    Liu, Yang
    Wang, Keze
    Liu, Lingbo
    Lan, Haoyuan
    Lin, Liang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1978 - 1993
  • [33] Practical investment with the long-short game
    Al-baghdadi, Najim
    Kalnishkan, Yuri
    Lindsay, David
    Lindsay, Sian
    ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2023,
  • [34] The Case for Long-Short Commodity Investing
    Miffre, Joelle
    Fernandez-Perez, Adrian
    JOURNAL OF ALTERNATIVE INVESTMENTS, 2015, 18 (01): : 92 - 104
  • [35] A Long-Short Term Memory Neural Network Based Rate Control Method for Video Coding
    Zhang, Zheng-Teng
    Lin, Jucai
    Fang, Ruidong
    Lu, Juan
    Chen, Yao
    PROCEEDINGS OF 2018 THE 2ND INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING (ICVIP 2018), 2018, : 155 - 160
  • [36] Temporal Contrastive Pretraining for Video Action Recognition
    Lorre, Guillaume
    Rabarisoa, Jaonary
    Orcesi, Astrid
    Ainouz, Samia
    Canu, Stephane
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 651 - 659
  • [37] TCKGE: Transformers with contrastive learning for knowledge graph embedding
    Xiaowei Zhang
    Quan Fang
    Jun Hu
    Shengsheng Qian
    Changsheng Xu
    International Journal of Multimedia Information Retrieval, 2022, 11 : 589 - 597
  • [38] Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation
    Wang, Lulu
    Xu, Zengmin
    Zhang, Xuelian
    Meng, Ruxing
    Lu, Tao
    Computer Engineering and Applications, 2024, 60 (18) : 158 - 166
  • [39] TCKGE: Transformers with contrastive learning for knowledge graph embedding
    Zhang, Xiaowei
    Fang, Quan
    Hu, Jun
    Qian, Shengsheng
    Xu, Changsheng
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (04) : 589 - 597
  • [40] Attentive spatial-temporal contrastive learning for self-supervised video representation
    Yang, Xingming
    Xiong, Sixuan
    Wu, Kewei
    Shan, Dongfeng
    Xie, Zhao
    IMAGE AND VISION COMPUTING, 2023, 137