Temporal Shift Module-Based Vision Transformer Network for Action Recognition

被引:1
|
作者
Zhang, Kunpeng [1 ]
Lyu, Mengyan [1 ]
Guo, Xinxin [1 ]
Zhang, Liye [1 ]
Liu, Cong [1 ]
机构
[1] Shandong Univ Technol, Coll Comp Sci & Technol, Zibo 255000, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Computational modeling; Convolutional neural networks; Computer architecture; Task analysis; Image segmentation; Head; Action recognition; self-attention; temporal shift module; vision transformer;
D O I
10.1109/ACCESS.2024.3379885
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces a novel action recognition model named ViT-Shift, which combines the Time Shift Module (TSM) with the Vision Transformer (ViT) architecture. Traditional video action recognition tasks face significant computational challenges, requiring substantial computing resources. However, our model successfully addresses this issue by incorporating the TSM, achieving outstanding performance while significantly reducing computational costs. Our approach is based on the latest Transformer self-attention mechanism, applied to video sequence processing instead of traditional convolutional methods. To preserve the core architecture of ViT and transfer its excellent performance in image recognition to video action recognition, we strategically introduce the TSM only before the multi-head attention layer of ViT. This design allows us to simulate temporal interactions using channel shifts, effectively reducing computational complexity. We carefully design the position and shift parameters of the TSM to maximize the model's performance. Experimental results demonstrate that ViT-Shift achieves remarkable results on two standard action recognition datasets. With ImageNet-21K pretraining, we achieve an accuracy of 77.55% on the Kinetics-400 dataset and 93.07% on the UCF-101 dataset.
引用
收藏
页码:47246 / 47257
页数:12
相关论文
共 50 条
  • [21] A network module-based method for identifying cancer prognostic signatures
    Guanming Wu
    Lincoln Stein
    Genome Biology, 13
  • [22] A Module-Based Educational Platform for Transformer Differential Digital Relay Design and Experimentation
    Lodu, Moses Gabriel
    Fujita, Goro
    2018 IEEE PES ASIA-PACIFIC POWER AND ENERGY ENGINEERING CONFERENCE (APPEEC), 2018,
  • [23] Network module-based drug repositioning for pulmonary arterial hypertension
    Wang, Rui-Sheng
    Loscalzo, Joseph
    CPT-PHARMACOMETRICS & SYSTEMS PHARMACOLOGY, 2021, 10 (09): : 994 - 1005
  • [24] Module-Based Association Analysis for Omics Data with Network Structure
    Wang, Zhi
    Maity, Arnab
    Hsiao, Chuhsing Kate
    Voora, Deepak
    Kaddurah-Daouk, Rima
    Tzeng, Jung-Ying
    PLOS ONE, 2015, 10 (03):
  • [25] Transformer-based network with temporal depthwise convolutions for sEMG recognition
    Wang, Zefeng
    Yao, Junfeng
    Xu, Meiyan
    Jiang, Min
    Su, Jinsong
    PATTERN RECOGNITION, 2024, 145
  • [26] A network module-based method for identifying cancer prognostic signatures
    Wu, Guanming
    Stein, Lincoln
    GENOME BIOLOGY, 2012, 13 (12):
  • [27] A GCN and Transformer complementary network for skeleton-based action recognition
    Xiang, Xuezhi
    Li, Xiaoheng
    Liu, Xuzhao
    Qiao, Yulong
    El Saddik, Abdulmotaleb
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
  • [28] Skeleton-based action recognition via spatial and temporal transformer networks
    Plizzari, Chiara
    Cannici, Marco
    Matteucci, Matteo
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 208 (208-209)
  • [29] A Graph Skeleton Transformer Network for Action Recognition
    Jiang, Yujian
    Sun, Zhaoneng
    Yu, Saisai
    Wang, Shuang
    Song, Yang
    SYMMETRY-BASEL, 2022, 14 (08):
  • [30] Video Action Recognition Based on Spatio-temporal Feature Pyramid Module
    Gong, Suming
    Chen, Ying
    2020 13TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2020), 2020, : 338 - 341