Space-time Mixing Attention for Video Transformer

被引:0
|
作者
Bulat, Adrian [1 ]
Perez-Rua, Juan-Manuel [1 ]
Sudhakaran, Swathikiran [1 ]
Martinez, Brais [1 ]
Tzimiropoulos, Georgios [1 ,2 ]
机构
[1] Samsung AI Cambridge, Cambridge, England
[2] Queen Mary Univ London, London, England
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time being significantly more efficient than other Video Transformer models. Code for our method is made available here.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Video Object Segmentation using Space-Time Memory Networks
    Oh, Seoung Wug
    Lee, Joon-Young
    Xu, Ning
    Kim, Seon Joo
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9225 - 9234
  • [42] Video Dialog as Conversation About Objects Living in Space-Time
    Hoang-Anh Pham
    Thao Minh Le
    Vuong Le
    Tu Minh Phuong
    Truyen Tran
    [J]. COMPUTER VISION, ECCV 2022, PT XXXIX, 2022, 13699 : 710 - 726
  • [43] A Subjective and Objective Study of Space-Time Subsampled Video Quality
    Lee, Dae Yeol
    Paul, Somdyuti
    Bampis, Christos G.
    Ko, Hyunsuk
    Kim, Jongho
    Jeong, Se Yoon
    Homan, Blake
    Bovik, Alan C.
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 934 - 948
  • [44] Space-Time Super-Resolution from a Single Video
    Shahar, Oded
    Faktor, Alon
    Irani, Michal
    [J]. 2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011,
  • [45] Space-Time Analysis of Video Sequences for Detecting Abandoned Objects
    Ivanov, V. A.
    Kirichuk, V. S.
    Orlov, S. I.
    [J]. OPTOELECTRONICS INSTRUMENTATION AND DATA PROCESSING, 2011, 47 (01) : 23 - 28
  • [47] Probabilistic space-time video modeling via piecewise GMM
    Greenspan, H
    Goldberger, J
    Mayer, A
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2004, 26 (03) : 384 - 396
  • [48] Video quality assessment using space-time slice mappings
    Liu, Lixiong
    Wang, Tianshu
    Huang, Hua
    Bovik, Alan Conrad
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 82
  • [49] Layered video transmission over space-time coded systems
    Kuo, CH
    Kim, CS
    Kuo, CCJ
    [J]. APPLICATIONS OF DIGITAL IMAGE PROCESSING XXIV, 2001, 4472 : 482 - 491
  • [50] Space-time Prompting for Video Class-incremental Learning
    Pei, Yixuan
    Qing, Zhiwu
    Zhang, Shiwei
    Wang, Xiang
    Zhang, Yingya
    Zhao, Deli
    Qian, Xueming
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11898 - 11908