WLiT: Windows and Linear Transformer for Video Action Recognition

被引：4

作者：

Sun, Ruoxi ^{[1
,2
]}

Zhang, Tianzhao ^{[1
,3
]}

Wan, Yong ^{[4
]}

Zhang, Fuping ^{[1
]}

Wei, Jianming ^{[1
]}

机构：

[1] Chinese Acad Sci, Shanghai Adv Res Inst, Shanghai 201210, Peoples R China

[2] Shanghai Tech Univ, Sch Informat Sci & Technol, Shanghai 201210, Peoples R China

[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100049, Peoples R China

[4] Chinese Acad Sci, Inst Rock & Soil Mech, State Key Lab Geomech & Geotech Engn, Wuhan 430071, Peoples R China

来源：

SENSORS | 2023年 / 23卷 / 03期

基金：

中国国家自然科学基金;

关键词：

action recognition; Spatial-Windows attention; linear attention; self-attention; transformer;

D O I：

10.3390/s23031616

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

The emergence of Transformer has led to the rapid development of video understanding, but it also brings the problem of high computational complexity. Previously, there were methods to divide the feature maps into windows along the spatiotemporal dimensions and then calculate the attention. There are also methods to perform down-sampling during attention computation to reduce the spatiotemporal resolution of features. Although the complexity is effectively reduced, there is still room for further optimization. Thus, we present the Windows and Linear Transformer (WLiT) for efficient video action recognition, by combining Spatial-Windows attention with Linear attention. We first divide the feature maps into multiple windows along the spatial dimensions and calculate the attention separately inside the windows. Therefore, our model further reduces the computational complexity compared with previous methods. However, the perceptual field of Spatial-Windows attention is small, and global spatiotemporal information cannot be obtained. To address this problem, we then calculate Linear attention along the channel dimension so that the model can capture complete spatiotemporal information. Our method achieves better recognition accuracy with less computational complexity through this mechanism. We conduct extensive experiments on four public datasets, namely Something-Something V2 (SSV2), Kinetics400 (K400), UCF101, and HMDB51. On the SSV2 dataset, our method reduces the computational complexity by 28% and improves the recognition accuracy by 1.6% compared to the State-Of-The-Art (SOTA) method. On the K400 and two other datasets, our method achieves SOTA-level accuracy while reducing the complexity by about 49%.

引用

页数：19

共 50 条

[41] Meta-action descriptor for action recognition in RGBD video
Huang, Min
Su, Song-Zhi
Cai, Guo-Rong
Zhang, Hong-Bo
Cao, Donglin
Li, Shao-Zi
IET COMPUTER VISION, 2017, 11 (04) : 301 - 308
[42] Short-Term Action Learning for Video Action Recognition
Ting-Long, Liu
IEEE ACCESS, 2024, 12 : 30867 - 30875
[43] Learning hierarchical video representation for action recognition
Li Q.
Qiu Z.
Yao T.
Mei T.
Rui Y.
Luo J.
International Journal of Multimedia Information Retrieval, 2017, 6 (1) : 85 - 98
[44] Leveraging Temporal Contextualization for Video Action Recognition
Kim, Minji
Han, Dongyoon
Kim, Taekyung
Han, Bohyung
COMPUTER VISION - ECCV 2024, PT XXI, 2025, 15079 : 74 - 91
[45] Averaging Video Sequences to Improve Action Recognition
Gao, Zhen
Lu, Guoliang
Yan, Peng
2016 9TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI 2016), 2016, : 89 - 93
[46] Spatiotemporal Fusion Networks for Video Action Recognition
Liu, Zheng
Hu, Haifeng
Zhang, Junxuan
NEURAL PROCESSING LETTERS, 2019, 50 (02) : 1877 - 1890
[47] Deep Local Video Feature for Action Recognition
Lan, Zhenzhong
Zhu, Yi
Hauptmann, Alexander G.
Newsam, Shawn
2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2017, : 1219 - 1225
[48] A Robust and Efficient Video Representation for Action Recognition
Wang, Heng
Oneata, Dan
Verbeek, Jakob
Schmid, Cordelia
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2016, 119 (03) : 219 - 238
[49] Action Keypoint Network for Efficient Video Recognition
Chen, Xu
Han, Yahong
Wang, Xiaohan
Sun, Yifan
Yang, Yi
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 4980 - 4993
[50] Binary Neural Network for Video Action Recognition
Han, Hongfeng
Lu, Zhiwu
Wen, Ji-Rong
MULTIMEDIA MODELING, MMM 2023, PT I, 2023, 13833 : 95 - 106

← 1 2 3 4 5 →