Temporal Shift Module-Based Vision Transformer Network for Action Recognition

被引:1
|
作者
Zhang, Kunpeng [1 ]
Lyu, Mengyan [1 ]
Guo, Xinxin [1 ]
Zhang, Liye [1 ]
Liu, Cong [1 ]
机构
[1] Shandong Univ Technol, Coll Comp Sci & Technol, Zibo 255000, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Computational modeling; Convolutional neural networks; Computer architecture; Task analysis; Image segmentation; Head; Action recognition; self-attention; temporal shift module; vision transformer;
D O I
10.1109/ACCESS.2024.3379885
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces a novel action recognition model named ViT-Shift, which combines the Time Shift Module (TSM) with the Vision Transformer (ViT) architecture. Traditional video action recognition tasks face significant computational challenges, requiring substantial computing resources. However, our model successfully addresses this issue by incorporating the TSM, achieving outstanding performance while significantly reducing computational costs. Our approach is based on the latest Transformer self-attention mechanism, applied to video sequence processing instead of traditional convolutional methods. To preserve the core architecture of ViT and transfer its excellent performance in image recognition to video action recognition, we strategically introduce the TSM only before the multi-head attention layer of ViT. This design allows us to simulate temporal interactions using channel shifts, effectively reducing computational complexity. We carefully design the position and shift parameters of the TSM to maximize the model's performance. Experimental results demonstrate that ViT-Shift achieves remarkable results on two standard action recognition datasets. With ImageNet-21K pretraining, we achieve an accuracy of 77.55% on the Kinetics-400 dataset and 93.07% on the UCF-101 dataset.
引用
收藏
页码:47246 / 47257
页数:12
相关论文
共 50 条
  • [31] Human action recognition method based on Motion Excitation and Temporal Aggregation module
    Ye, Qing
    Tan, Zexian
    Zhang, Yongmei
    HELIYON, 2022, 8 (11)
  • [32] Fast and lightweight automatic lithology recognition based on efficient vision transformer network
    Guo, Yan
    Li, Zhuowu
    Liu, Fujiang
    Lin, Weihua
    Liu, Hongchen
    Shao, Quansen
    Zhang, Dexiong
    Liang, Weichao
    Su, Junshun
    Gao, Qiankai
    SOLID EARTH SCIENCES, 2025, 10 (01)
  • [33] GROUPED TEMPORAL ENHANCEMENT MODULE FOR HUMAN ACTION RECOGNITION
    Liu, Hong
    Ren, Bin
    Liu, Mengyuan
    Ding, Runwei
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1801 - 1805
  • [34] 2D Deep Video Capsule Network with Temporal Shift for Action Recognition
    Voillemin, Theo
    Wannous, Hazem
    Vandeborre, Jean-Philippe
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3513 - 3519
  • [35] Spatial-temporal interaction module for action recognition
    Luo, Hui-Lan
    Chen, Han
    Cheung, Yiu-Ming
    Yu, Yawei
    JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (04)
  • [36] Worker behavior recognition based on temporal and spatial self-attention of vision Transformer
    Lu Y.-X.
    Xu G.-H.
    Tang B.
    Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2023, 57 (03): : 446 - 454
  • [37] Temporal Shift Module with Pretrained Representations for Speech Emotion Recognition
    Shen, Siyuan
    Liu, Feng
    Wang, Hanyang
    Wang, Yunlong
    Zhou, Aimin
    INTELLIGENT COMPUTING, 2024, 3
  • [38] Action Recognition of Temporal Segment Network Based on Feature Fusion
    Li H.
    Ding Y.
    Li C.
    Zhang S.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2020, 57 (01): : 145 - 158
  • [39] A Context Based Deep Temporal Embedding Network in Action Recognition
    Koohzadi, Maryam
    Charkari, Nasrollah Moghadam
    NEURAL PROCESSING LETTERS, 2020, 52 (01) : 187 - 220
  • [40] A Context Based Deep Temporal Embedding Network in Action Recognition
    Maryam Koohzadi
    Nasrollah Moghadam Charkari
    Neural Processing Letters, 2020, 52 : 187 - 220