Mixed Attention and Channel Shift Transformer for Efficient Action Recognition

被引：0

作者：

Lu, Xiusheng ^{[1
]}

Hao, Yanbin ^{[2
]}

Cheng, Lechao ^{[3
]}

Zhao, Sicheng ^{[1
]}

Li, Yutao ^{[4
]}

Song, Mingli ^{[5
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] Univ Sci & Technol China, Hefei, Peoples R China

[3] Hefei Univ Technol, Hefei, Peoples R China

[4] Ocean Univ China, Qingdao, Peoples R China

[5] Zhejiang Univ, Hangzhou, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2025年 / 21卷 / 03期

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Action recognition; mixed attention; random attention; channel shift;

D O I：

10.1145/3712594

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The practical use of the Transformer-based methods for processing videos is constrained by the high computing complexity. Although previous approaches adopt the spatiotemporal decomposition of 3D attention to mitigate the issue, they suffer from the drawback of neglecting the majority of visual tokens. This article presents a novel mixed attention operation that subtly fuses the random, spatial, and temporal attention mechanisms. The proposed random attention stochastically samples video tokens in a simple yet effective way, complementing other attention methods. Furthermore, since the attention operation concentrates on learning long-distance relationships, we employ the channel shift operation to encode short-term temporal characteristics. Our model can provide more comprehensive motion representations thanks to the amalgamation of these techniques. Experimental results show that the proposed method produces competitive action recognition results with low computational overhead on both large-scale and small-scale public video datasets.

引用

页数：20

共 50 条

[41] Action Transformer: A self-attention model for short-time pose-based human action recognition
Mazzia, Vittorio
Angarano, Simone
Salvetti, Francesco
Angelini, Federico
Chiaberge, Marcello
PATTERN RECOGNITION, 2022, 124
[42] An efficient self-attention network for skeleton-based action recognition
Xiaofei Qin
Rui Cai
Jiabin Yu
Changxiang He
Xuedian Zhang
Scientific Reports, 12 (1)
[43] An efficient self-attention network for skeleton-based action recognition
Qin, Xiaofei
Cai, Rui
Yu, Jiabin
He, Changxiang
Zhang, Xuedian
SCIENTIFIC REPORTS, 2022, 12 (01):
[44] Image super-resolution reconstruction using Swin Transformer with efficient channel attention networks
Sun, Zhenxi
Zhang, Jin
Chen, Ziyi
Hong, Lu
Zhang, Rui
Li, Weishi
Xia, Haojie
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 136
[45] Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition
Zhang, Xiaoyan
Cui, Yujie
Huo, Yongkai
VISUAL COMPUTER, 2023, 39 (08): : 3247 - 3257
[46] Multi-level channel attention excitation network for human action recognition in videos
Wu, Hanbo
Ma, Xin
Li, Yibin
SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 114
[47] An efficient mixed attention module
Sheng, Kuang
Chen, Pinghua
IET COMPUTER VISION, 2023, 17 (04) : 496 - 507
[48] Action Recognition Based on Multi-Level Topological Channel Attention of Human Skeleton
Hu, Kai
Shen, Chaowen
Wang, Tianyan
Shen, Shuai
Cai, Chengxue
Huang, Huaming
Xia, Min
SENSORS, 2023, 23 (24)
[49] Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition
Xiaoyan Zhang
Yujie Cui
Yongkai Huo
The Visual Computer, 2023, 39 : 3247 - 3257
[50] ResT: An Efficient Transformer for Visual Recognition
Zhang, Qing-Long
Yang, Yu -Bin
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34

← 1 2 3 4 5 →