Long-Short Temporal Contrastive Learning of Video Transformers

被引:12
|
作者
Wang, Jue [1 ]
Bertasius, Gedas [2 ]
Tran, Du [1 ]
Torresani, Lorenzo [1 ,3 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] Univ N Carolina, Chapel Hill, NC USA
[3] Dartmouth, Hanover, NH USA
关键词
D O I
10.1109/CVPR52688.2022.01362
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.
引用
收藏
页码:13990 / 14000
页数:11
相关论文
共 50 条
  • [21] Semantics Guided Contrastive Learning of Transformers for Zero-shot Temporal Activity Detection
    Nag, Sayak
    Goldstein, Orpaz
    Roy-Chowdhury, Amit K.
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 6232 - 6242
  • [22] Prediction of Turbulence Temporal Evolution in PANTA by Long-Short Term Memory Network
    Aizawacaranza M.
    Sasaki M.
    Minagawa H.
    Nakazawa Y.
    Liu Y.
    Jajima Y.
    Kawachi Y.
    Arakawa H.
    Hara K.
    Plasma and Fusion Research, 2022, 17 : 1201048 - 1
  • [23] Prediction of Turbulence Temporal Evolution in PANTA by Long-Short Term Memory Network
    Aizawacaranza, Masaomi
    Sasaki, Makoto
    Minagawa, Hiroki
    Nakazawa, Yuuki
    Liu, Yoshitatsu
    Jajima, Yuki
    Kawachi, Yuichi
    Arakawa, Hiroyuki
    Hara, Kazuyuki
    PLASMA AND FUSION RESEARCH, 2022, 17
  • [24] Group Activity Representation Learning With Long-Short States Predictive Transformer
    Kong, Longteng
    Zhou, Wanting
    Pei, Duoxuan
    He, Zhaofeng
    Huang, Di
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (12) : 7267 - 7281
  • [25] Long-short Distance Aggregation Networks for Positive Unlabeled Graph Learning
    Wu, Man
    Pan, Shirui
    Du, Lan
    Tsang, Ivor
    Zhu, Xingquan
    Du, Bo
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2157 - 2160
  • [26] LTCR: Long Temporal Characteristic Reconstruction for Segmentation in Contrastive Learning
    He, Yang
    Wu, Yuhan
    Zhang, Junru
    Dong, Yabo
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, PT V, ECML PKDD 2024, 2024, 14945 : 355 - 371
  • [27] LONG-SHORT ADDRESS OPTIMIZATION IN ASSEMBLERS
    WILLIAMS, MH
    SOFTWARE-PRACTICE & EXPERIENCE, 1979, 9 (03): : 227 - 235
  • [28] Practical Investment with the Long-Short Game
    Al-baghdadi, Najim
    Kalnishkan, Yuri
    Lindsay, David
    Lindsay, Sian
    CONFORMAL AND PROBABILISTIC PREDICTION AND APPLICATIONS, VOL 128, 2020, 128 : 209 - 228
  • [29] The Long-Short Story of Movie Description
    Rohrbach, Anna
    Rohrbach, Marcus
    Schiele, Bernt
    PATTERN RECOGNITION, GCPR 2015, 2015, 9358 : 209 - 221
  • [30] Contrastive Learning with Bidirectional Transformers for Sequential Recommendation
    Du, Hanwen
    Shi, Hui
    Zhao, Pengpeng
    Wang, Deqing
    Sheng, Victor S.
    Liu, Yanchi
    Liu, Guanfeng
    Zhao, Lei
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 396 - 405