WLiT: Windows and Linear Transformer for Video Action Recognition

被引:4
|
作者
Sun, Ruoxi [1 ,2 ]
Zhang, Tianzhao [1 ,3 ]
Wan, Yong [4 ]
Zhang, Fuping [1 ]
Wei, Jianming [1 ]
机构
[1] Chinese Acad Sci, Shanghai Adv Res Inst, Shanghai 201210, Peoples R China
[2] Shanghai Tech Univ, Sch Informat Sci & Technol, Shanghai 201210, Peoples R China
[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100049, Peoples R China
[4] Chinese Acad Sci, Inst Rock & Soil Mech, State Key Lab Geomech & Geotech Engn, Wuhan 430071, Peoples R China
基金
中国国家自然科学基金;
关键词
action recognition; Spatial-Windows attention; linear attention; self-attention; transformer;
D O I
10.3390/s23031616
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
The emergence of Transformer has led to the rapid development of video understanding, but it also brings the problem of high computational complexity. Previously, there were methods to divide the feature maps into windows along the spatiotemporal dimensions and then calculate the attention. There are also methods to perform down-sampling during attention computation to reduce the spatiotemporal resolution of features. Although the complexity is effectively reduced, there is still room for further optimization. Thus, we present the Windows and Linear Transformer (WLiT) for efficient video action recognition, by combining Spatial-Windows attention with Linear attention. We first divide the feature maps into multiple windows along the spatial dimensions and calculate the attention separately inside the windows. Therefore, our model further reduces the computational complexity compared with previous methods. However, the perceptual field of Spatial-Windows attention is small, and global spatiotemporal information cannot be obtained. To address this problem, we then calculate Linear attention along the channel dimension so that the model can capture complete spatiotemporal information. Our method achieves better recognition accuracy with less computational complexity through this mechanism. We conduct extensive experiments on four public datasets, namely Something-Something V2 (SSV2), Kinetics400 (K400), UCF101, and HMDB51. On the SSV2 dataset, our method reduces the computational complexity by 28% and improves the recognition accuracy by 1.6% compared to the State-Of-The-Art (SOTA) method. On the K400 and two other datasets, our method achieves SOTA-level accuracy while reducing the complexity by about 49%.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Action recognition on continuous video
    Y. L. Chang
    C. S. Chan
    P. Remagnino
    Neural Computing and Applications, 2021, 33 : 1233 - 1243
  • [22] A hierarchical Transformer network for smoke video recognition
    Cheng, Guangtao
    Xian, Baoyi
    Liu, Yifan
    Chen, Xue
    Hu, Lianjun
    Song, Zhanjie
    DIGITAL SIGNAL PROCESSING, 2025, 158
  • [23] TubeR: Tubelet Transformer for Video Action Detection
    Zhao, Jiaojiao
    Zhang, Yanyi
    Li, Xinyu
    Chen, Hao
    Shuai, Bing
    Xu, Mingze
    Liu, Chunhui
    Kundu, Kaustav
    Xiong, Yuanjun
    Modolo, Davide
    Marsic, Ivan
    Snoek, Cees G. M.
    Tighe, Joseph
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13588 - 13597
  • [24] Automatic excavator action recognition and localisation for untrimmed video using hybrid LSTM-Transformer networks
    Martin, Abbey
    Hill, Andrew J.
    Seiler, Konstantin M.
    Balamurali, Mehala
    INTERNATIONAL JOURNAL OF MINING RECLAMATION AND ENVIRONMENT, 2024, 38 (05) : 353 - 372
  • [25] Transformer-based deep learning model and video dataset for installation action recognition in offsite projects
    Jang, Junyoung
    Jeong, Eunbeen
    Kim, Tae Wan
    AUTOMATION IN CONSTRUCTION, 2025, 172
  • [26] STAR plus plus : Rethinking spatio-temporal cross attention transformer for video action recognition
    Ahn, Dasom
    Kim, Sangwon
    Ko, Byoung Chul
    APPLIED INTELLIGENCE, 2023, 53 (23) : 28446 - 28459
  • [27] A Graph Skeleton Transformer Network for Action Recognition
    Jiang, Yujian
    Sun, Zhaoneng
    Yu, Saisai
    Wang, Shuang
    Song, Yang
    SYMMETRY-BASEL, 2022, 14 (08):
  • [28] Coupling Video Segmentation and Action Recognition
    Ghodrati, Amir
    Pedersoli, Marco
    Tuytelaars, Tinne
    2014 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2014, : 618 - 625
  • [29] Breaking video into pieces for action recognition
    Ying Zheng
    Hongxun Yao
    Xiaoshuai Sun
    Xuesong Jiang
    Fatih Porikli
    Multimedia Tools and Applications, 2017, 76 : 22195 - 22212
  • [30] Action recognition in broadcast tennis video
    Zhu, Guangyu
    Xu, Changsheng
    Huang, Qingming
    Gao, Wen
    18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 251 - +