Temporal Shift Module-Based Vision Transformer Network for Action Recognition

被引:1
|
作者
Zhang, Kunpeng [1 ]
Lyu, Mengyan [1 ]
Guo, Xinxin [1 ]
Zhang, Liye [1 ]
Liu, Cong [1 ]
机构
[1] Shandong Univ Technol, Coll Comp Sci & Technol, Zibo 255000, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Computational modeling; Convolutional neural networks; Computer architecture; Task analysis; Image segmentation; Head; Action recognition; self-attention; temporal shift module; vision transformer;
D O I
10.1109/ACCESS.2024.3379885
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces a novel action recognition model named ViT-Shift, which combines the Time Shift Module (TSM) with the Vision Transformer (ViT) architecture. Traditional video action recognition tasks face significant computational challenges, requiring substantial computing resources. However, our model successfully addresses this issue by incorporating the TSM, achieving outstanding performance while significantly reducing computational costs. Our approach is based on the latest Transformer self-attention mechanism, applied to video sequence processing instead of traditional convolutional methods. To preserve the core architecture of ViT and transfer its excellent performance in image recognition to video action recognition, we strategically introduce the TSM only before the multi-head attention layer of ViT. This design allows us to simulate temporal interactions using channel shifts, effectively reducing computational complexity. We carefully design the position and shift parameters of the TSM to maximize the model's performance. Experimental results demonstrate that ViT-Shift achieves remarkable results on two standard action recognition datasets. With ImageNet-21K pretraining, we achieve an accuracy of 77.55% on the Kinetics-400 dataset and 93.07% on the UCF-101 dataset.
引用
收藏
页码:47246 / 47257
页数:12
相关论文
共 50 条
  • [11] Efficient Aggressive Behavior Recognition of Pigs Based on Temporal Shift Module
    Ji, Hengyi
    Teng, Guanghui
    Yu, Jionghua
    Wen, Yanbin
    Deng, Huixiang
    Zhuang, Yanrong
    ANIMALS, 2023, 13 (13):
  • [12] A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer
    Zhang, Hui
    Yang, Jiewen
    Dong, Xingbo
    Lv, Xingguo
    Jia, Wei
    Jin, Zhe
    Li, Xuejun
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT V, 2024, 14429 : 29 - 43
  • [13] ST-HViT: spatial-temporal hierarchical vision transformer for action recognition
    Xia, Limin
    Fu, Weiye
    PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (01)
  • [14] LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference
    Chen, Dong
    Wu, Peisong
    Chen, Mingdong
    Wu, Mengtao
    Zhang, Tao
    Li, Chuanqi
    FRONTIERS IN NEUROROBOTICS, 2024, 18
  • [15] Enhancing Accuracy of Face Recognition in Occluded Scenarios With Occlusion-Aware Module-Based Network
    Wang, Dalin
    Li, Rongfeng
    IEEE ACCESS, 2023, 11 : 117297 - 117307
  • [16] Spatial-Temporal Transformer Network for Continuous Action Recognition in Industrial Assembly
    Huang, Jianfeng
    Liu, Xiang
    Hu, Huan
    Tang, Shanghua
    Li, Chenyang
    Zhao, Shaoan
    Lin, Yimin
    Wang, Kai
    Liu, Zhaoxiang
    Lian, Shiguo
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT X, ICIC 2024, 2024, 14871 : 114 - 130
  • [17] MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module
    Zhang, Yi
    SENSORS, 2022, 22 (17)
  • [18] Masked facial expression recognition based on temporal overlap module and action unit graph convolutional network
    Zhang, Zheyuan
    Liu, Bingtong
    Zhou, Ju
    Wang, Hanpu
    Liu, Xinyu
    Lin, Bing
    Chen, Tong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2025, 107
  • [19] Collaborative Convolutional Transformer Network Based on Skeleton Action Recognition
    Shi, Yuexiang
    Zhu, Maoqing
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2023, 45 (04) : 1485 - 1493
  • [20] Hybrid Learning Module-Based Transformer for Multitrack Music Generation With Music Theory
    Tie, Yun
    Guo, Xin
    Zhang, Donghui
    Tie, Jiessie
    Qi, Lin
    Lu, Yuhang
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024,