ViViT: A Video Vision Transformer

被引:864
|
作者
Arnab, Anurag [1 ]
Dehghani, Mostafa [1 ]
Heigold, Georg [1 ]
Sun, Chen [1 ]
Lucic, Mario [1 ]
Schmid, Cordelia [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
关键词
D O I
10.1109/ICCV48922.2021.00676
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks.
引用
收藏
页码:6816 / 6826
页数:11
相关论文
共 50 条
  • [1] Coupling video vision transformer (ViVit) into land change simulation: a comparison with three-dimensional convolutional neural network (3DCNN)
    Li, Haiyang
    Fan, Liang
    Gao, Yifan
    Liu, Zhao
    Gao, Peichao
    [J]. JOURNAL OF SPATIAL SCIENCE, 2024, 69 (03) : 873 - 895
  • [2] Video Summarization With Spatiotemporal Vision Transformer
    Hsu, Tzu-Chun
    Liao, Yi-Sheng
    Huang, Chun-Rong
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3013 - 3026
  • [3] TransAnomaly: Video Anomaly Detection Using Video Vision Transformer
    Yuan, Hongchun
    Cai, Zhenyu
    Zhou, Hui
    Wang, Yue
    Chen, Xiangzhi
    [J]. IEEE ACCESS, 2021, 9 : 123977 - 123986
  • [4] Video Summarization With Frame Index Vision Transformer
    Hsu, Tzu-Chun
    Liao, Yi-Sheng
    Huang, Chun-Rong
    [J]. PROCEEDINGS OF 17TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA 2021), 2021,
  • [5] Temporally Efficient Vision Transformer for Video Instance Segmentation
    Yang, Shusheng
    Wang, Xinggang
    Li, Yu
    Fang, Yuxin
    Fang, Jiemin
    Liu, Wenyu
    Zhao, Xun
    Shan, Ying
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2875 - 2885
  • [6] Video captioning based on vision transformer and reinforcement learning
    Zhao, Hong
    Chen, Zhiwen
    Guo, Lan
    Han, Zeyu
    [J]. PeerJ Computer Science, 2022, 8
  • [7] Video captioning based on vision transformer and reinforcement learning
    Zhao, Hong
    Chen, Zhiwen
    Guo, Lan
    Han, Zeyu
    [J]. PEERJ COMPUTER SCIENCE, 2022, 8
  • [8] Utilization of Vision Transformer for Classification and Ranking of Video Distortions
    AlDahoul, Nouar
    Karim, Hezerul Abdul
    Tan, Myles Joshua Toledo
    [J]. ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, ANNPR 2022, 2023, 13739 : 195 - 204
  • [9] Improved Deepfake Video Detection Using Convolutional Vision Transformer
    Deressa, Deressa Wodajo
    Lambert, Peter
    Van Wallendael, Glenn
    Atnafu, Solomon
    Mareen, Hannes
    [J]. 2024 IEEE GAMING, ENTERTAINMENT, AND MEDIA CONFERENCE, GEM 2024, 2024, : 492 - 497
  • [10] A Revised Video Vision Transformer for Traffic Estimation With Fleet Trajectories
    Li, Duo
    Lasenby, Joan
    [J]. IEEE SENSORS JOURNAL, 2022, 22 (17) : 17103 - 17112