Long-Short Temporal Contrastive Learning of Video Transformers

被引：12

作者：

Wang, Jue ^{[1
]}

Bertasius, Gedas ^{[2
]}

Tran, Du ^{[1
]}

Torresani, Lorenzo ^{[1
,3
]}

机构：

[1] Facebook AI Res, Menlo Pk, CA 94025 USA

[2] Univ N Carolina, Chapel Hill, NC USA

[3] Dartmouth, Hanover, NH USA

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01362

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.

引用

页码：13990 / 14000

页数：11

共 50 条

[31] Compressed Video Contrastive Learning
Huo, Yuqi
Ding, Mingyu
Lu, Haoyu
Fei, Nanyi
Lu, Zhiwu
Wen, Ji-Rong
Luo, Ping
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[32] TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning
Liu, Yang
Wang, Keze
Liu, Lingbo
Lan, Haoyuan
Lin, Liang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1978 - 1993
[33] Practical investment with the long-short game
Al-baghdadi, Najim
Kalnishkan, Yuri
Lindsay, David
Lindsay, Sian
ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2023,
[34] The Case for Long-Short Commodity Investing
Miffre, Joelle
Fernandez-Perez, Adrian
JOURNAL OF ALTERNATIVE INVESTMENTS, 2015, 18 (01): : 92 - 104
[35] A Long-Short Term Memory Neural Network Based Rate Control Method for Video Coding
Zhang, Zheng-Teng
Lin, Jucai
Fang, Ruidong
Lu, Juan
Chen, Yao
PROCEEDINGS OF 2018 THE 2ND INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING (ICVIP 2018), 2018, : 155 - 160
[36] Temporal Contrastive Pretraining for Video Action Recognition
Lorre, Guillaume
Rabarisoa, Jaonary
Orcesi, Astrid
Ainouz, Samia
Canu, Stephane
2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 651 - 659
[37] TCKGE: Transformers with contrastive learning for knowledge graph embedding
Xiaowei Zhang
Quan Fang
Jun Hu
Shengsheng Qian
Changsheng Xu
International Journal of Multimedia Information Retrieval, 2022, 11 : 589 - 597
[38] Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation
Wang, Lulu
Xu, Zengmin
Zhang, Xuelian
Meng, Ruxing
Lu, Tao
Computer Engineering and Applications, 2024, 60 (18) : 158 - 166
[39] TCKGE: Transformers with contrastive learning for knowledge graph embedding
Zhang, Xiaowei
Fang, Quan
Hu, Jun
Qian, Shengsheng
Xu, Changsheng
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (04) : 589 - 597
[40] Attentive spatial-temporal contrastive learning for self-supervised video representation
Yang, Xingming
Xiong, Sixuan
Wu, Kewei
Shan, Dongfeng
Xie, Zhao
IMAGE AND VISION COMPUTING, 2023, 137

← 1 2 3 4 5 →