Mixed Attention and Channel Shift Transformer for Efficient Action Recognition

被引:0
|
作者
Lu, Xiusheng [1 ]
Hao, Yanbin [2 ]
Cheng, Lechao [3 ]
Zhao, Sicheng [1 ]
Li, Yutao [4 ]
Song, Mingli [5 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Hefei Univ Technol, Hefei, Peoples R China
[4] Ocean Univ China, Qingdao, Peoples R China
[5] Zhejiang Univ, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Action recognition; mixed attention; random attention; channel shift;
D O I
10.1145/3712594
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The practical use of the Transformer-based methods for processing videos is constrained by the high computing complexity. Although previous approaches adopt the spatiotemporal decomposition of 3D attention to mitigate the issue, they suffer from the drawback of neglecting the majority of visual tokens. This article presents a novel mixed attention operation that subtly fuses the random, spatial, and temporal attention mechanisms. The proposed random attention stochastically samples video tokens in a simple yet effective way, complementing other attention methods. Furthermore, since the attention operation concentrates on learning long-distance relationships, we employ the channel shift operation to encode short-term temporal characteristics. Our model can provide more comprehensive motion representations thanks to the amalgamation of these techniques. Experimental results show that the proposed method produces competitive action recognition results with low computational overhead on both large-scale and small-scale public video datasets.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] Action Transformer: A self-attention model for short-time pose-based human action recognition
    Mazzia, Vittorio
    Angarano, Simone
    Salvetti, Francesco
    Angelini, Federico
    Chiaberge, Marcello
    PATTERN RECOGNITION, 2022, 124
  • [42] An efficient self-attention network for skeleton-based action recognition
    Xiaofei Qin
    Rui Cai
    Jiabin Yu
    Changxiang He
    Xuedian Zhang
    Scientific Reports, 12 (1)
  • [43] An efficient self-attention network for skeleton-based action recognition
    Qin, Xiaofei
    Cai, Rui
    Yu, Jiabin
    He, Changxiang
    Zhang, Xuedian
    SCIENTIFIC REPORTS, 2022, 12 (01):
  • [44] Image super-resolution reconstruction using Swin Transformer with efficient channel attention networks
    Sun, Zhenxi
    Zhang, Jin
    Chen, Ziyi
    Hong, Lu
    Zhang, Rui
    Li, Weishi
    Xia, Haojie
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 136
  • [45] Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition
    Zhang, Xiaoyan
    Cui, Yujie
    Huo, Yongkai
    VISUAL COMPUTER, 2023, 39 (08): : 3247 - 3257
  • [46] Multi-level channel attention excitation network for human action recognition in videos
    Wu, Hanbo
    Ma, Xin
    Li, Yibin
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 114
  • [47] An efficient mixed attention module
    Sheng, Kuang
    Chen, Pinghua
    IET COMPUTER VISION, 2023, 17 (04) : 496 - 507
  • [48] Action Recognition Based on Multi-Level Topological Channel Attention of Human Skeleton
    Hu, Kai
    Shen, Chaowen
    Wang, Tianyan
    Shen, Shuai
    Cai, Chengxue
    Huang, Huaming
    Xia, Min
    SENSORS, 2023, 23 (24)
  • [49] Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition
    Xiaoyan Zhang
    Yujie Cui
    Yongkai Huo
    The Visual Computer, 2023, 39 : 3247 - 3257
  • [50] ResT: An Efficient Transformer for Visual Recognition
    Zhang, Qing-Long
    Yang, Yu -Bin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34