Space-time Mixing Attention for Video Transformer

被引:0
|
作者
Bulat, Adrian [1 ]
Perez-Rua, Juan-Manuel [1 ]
Sudhakaran, Swathikiran [1 ]
Martinez, Brais [1 ]
Tzimiropoulos, Georgios [1 ,2 ]
机构
[1] Samsung AI Cambridge, Cambridge, England
[2] Queen Mary Univ London, London, England
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time being significantly more efficient than other Video Transformer models. Code for our method is made available here.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] CapFormer: A Space-Time Video Description Model using Joint-Attention Transformer
    Moussa, Mahamat
    Lim, Chern Hong
    Wong, KokSheik
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 759 - 764
  • [2] Mixing space-time derivatives for video compressive sensing
    Yang, Yi
    Schaeffer, Hayden
    Yin, Wotao
    Osher, Stanley
    [J]. 2013 ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS, 2013, : 158 - 162
  • [3] STARVQA: SPACE-TIME ATTENTION FOR VIDEO QUALITY ASSESSMENT
    Xing, Fengchuang
    Wang, Yuan-Gen
    Wang, Hanpin
    Li, Leida
    Zhu, Guopu
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2326 - 2330
  • [4] Is Space-Time Attention All You Need for Video Understanding?
    Bertasius, Gedas
    Wang, Heng
    Torresani, Lorenzo
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [5] Space-Time Video Super-Resolution 3D Transformer
    Zheng, Minyan
    Luo, Jianping
    [J]. MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 374 - 385
  • [6] Space-time completion of video
    Wexler, Yonatan
    Shechtman, Eli
    Irani, Michal
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2007, 29 (03) : 463 - 476
  • [7] Space-time video completion
    Wexler, Y
    Shechtman, E
    Irani, M
    [J]. PROCEEDINGS OF THE 2004 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, 2004, : 120 - 127
  • [8] STDAN: Deformable Attention Network for Space-Time Video Super-Resolution
    Wang, Hai
    Xiang, Xiaoyu
    Tian, Yapeng
    Yang, Wenming
    Liao, Qingmin
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (08) : 10606 - 10616
  • [9] Video summarization network based on Space-Time attention and genetic algorithm optimization
    Ao, Naixiang
    Shi, Fucheng
    [J]. PROCEEDINGS OF 2024 3RD INTERNATIONAL CONFERENCE ON CYBER SECURITY, ARTIFICIAL INTELLIGENCE AND DIGITAL ECONOMY, CSAIDE 2024, 2024, : 420 - 425
  • [10] STDAN: Deformable Attention Network for Space-Time Video Super-Resolution
    Wang, Hai
    Xiang, Xiaoyu
    Tian, Yapeng
    Yang, Wenming
    Liao, Qingmin
    [J]. arXiv, 2022,