Mixed Attention and Channel Shift Transformer for Efficient Action Recognition

被引:0
|
作者
Lu, Xiusheng [1 ]
Hao, Yanbin [2 ]
Cheng, Lechao [3 ]
Zhao, Sicheng [1 ]
Li, Yutao [4 ]
Song, Mingli [5 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Hefei Univ Technol, Hefei, Peoples R China
[4] Ocean Univ China, Qingdao, Peoples R China
[5] Zhejiang Univ, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Action recognition; mixed attention; random attention; channel shift;
D O I
10.1145/3712594
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The practical use of the Transformer-based methods for processing videos is constrained by the high computing complexity. Although previous approaches adopt the spatiotemporal decomposition of 3D attention to mitigate the issue, they suffer from the drawback of neglecting the majority of visual tokens. This article presents a novel mixed attention operation that subtly fuses the random, spatial, and temporal attention mechanisms. The proposed random attention stochastically samples video tokens in a simple yet effective way, complementing other attention methods. Furthermore, since the attention operation concentrates on learning long-distance relationships, we employ the channel shift operation to encode short-term temporal characteristics. Our model can provide more comprehensive motion representations thanks to the amalgamation of these techniques. Experimental results show that the proposed method produces competitive action recognition results with low computational overhead on both large-scale and small-scale public video datasets.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] An efficient video transformer network with token discard and keyframe enhancement for action recognition
    Zhang, Qian
    Yang, Zuosui
    Shao, Mingwen
    Liang, Hong
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (02):
  • [32] Efficient transformer tracking with adaptive attention
    Xiao, Dingkun
    Wei, Zhenzhong
    Zhang, Guangjun
    IET COMPUTER VISION, 2024,
  • [33] Spatial-temporal channel-wise attention network for action recognition
    Chen, Lin
    Liu, Yungang
    Man, Yongchao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (14) : 21789 - 21808
  • [34] Spatial-temporal channel-wise attention network for action recognition
    Lin Chen
    Yungang Liu
    Yongchao Man
    Multimedia Tools and Applications, 2021, 80 : 21789 - 21808
  • [35] Facial Expression Recognition Based on Fine-Tuned Channel-Spatial Attention Transformer
    Yao, Huang
    Yang, Xiaomeng
    Chen, Di
    Wang, Zhao
    Tian, Yuan
    SENSORS, 2023, 23 (15)
  • [36] STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition
    Dasom Ahn
    Sangwon Kim
    Byoung Chul Ko
    Applied Intelligence, 2023, 53 : 28446 - 28459
  • [37] Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer
    Liu, Minghua
    Li, Wenjing
    He, Bo
    Wang, Chuanxu
    Qu, Lianen
    APPLIED SCIENCES-BASEL, 2025, 15 (05):
  • [38] Mixed Resolution Network with hierarchical motion modeling for efficient action recognition
    Lu, Xiusheng
    Zhao, Sicheng
    Cheng, Lechao
    Zheng, Ying
    Fan, Xueqiao
    Song, Mingli
    KNOWLEDGE-BASED SYSTEMS, 2024, 294
  • [39] Efficient DenseNet Model with Fusion of Channel and Spatial Attention for Facial Expression Recognition
    Long, Duong Thang
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2024, 24 (01) : 171 - 189
  • [40] Recurring the Transformer for Video Action Recognition
    Yang, Jiewen
    Dong, Xingbo
    Liu, Liujun
    Zhang, Chao
    Shen, Jiajun
    Yu, Dahai
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14043 - 14053