WLiT: Windows and Linear Transformer for Video Action Recognition

被引:4
|
作者
Sun, Ruoxi [1 ,2 ]
Zhang, Tianzhao [1 ,3 ]
Wan, Yong [4 ]
Zhang, Fuping [1 ]
Wei, Jianming [1 ]
机构
[1] Chinese Acad Sci, Shanghai Adv Res Inst, Shanghai 201210, Peoples R China
[2] Shanghai Tech Univ, Sch Informat Sci & Technol, Shanghai 201210, Peoples R China
[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100049, Peoples R China
[4] Chinese Acad Sci, Inst Rock & Soil Mech, State Key Lab Geomech & Geotech Engn, Wuhan 430071, Peoples R China
基金
中国国家自然科学基金;
关键词
action recognition; Spatial-Windows attention; linear attention; self-attention; transformer;
D O I
10.3390/s23031616
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
The emergence of Transformer has led to the rapid development of video understanding, but it also brings the problem of high computational complexity. Previously, there were methods to divide the feature maps into windows along the spatiotemporal dimensions and then calculate the attention. There are also methods to perform down-sampling during attention computation to reduce the spatiotemporal resolution of features. Although the complexity is effectively reduced, there is still room for further optimization. Thus, we present the Windows and Linear Transformer (WLiT) for efficient video action recognition, by combining Spatial-Windows attention with Linear attention. We first divide the feature maps into multiple windows along the spatial dimensions and calculate the attention separately inside the windows. Therefore, our model further reduces the computational complexity compared with previous methods. However, the perceptual field of Spatial-Windows attention is small, and global spatiotemporal information cannot be obtained. To address this problem, we then calculate Linear attention along the channel dimension so that the model can capture complete spatiotemporal information. Our method achieves better recognition accuracy with less computational complexity through this mechanism. We conduct extensive experiments on four public datasets, namely Something-Something V2 (SSV2), Kinetics400 (K400), UCF101, and HMDB51. On the SSV2 dataset, our method reduces the computational complexity by 28% and improves the recognition accuracy by 1.6% compared to the State-Of-The-Art (SOTA) method. On the K400 and two other datasets, our method achieves SOTA-level accuracy while reducing the complexity by about 49%.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Video Action Retrieval Using Action Recognition Model
    Iinuma, Yuko
    Satoh, Shin'ichi
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 603 - 606
  • [32] Modeling Video Evolution For Action Recognition
    Fernando, Basura
    Gavves, Efstratios
    Oramas, Jose M.
    Ghodrati, Amir
    Tuytelaars, Tinne
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 5378 - 5387
  • [33] Breaking video into pieces for action recognition
    Zheng, Ying
    Yao, Hongxun
    Sun, Xiaoshuai
    Jiang, Xuesong
    Porikli, Fatih
    MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (21) : 22195 - 22212
  • [34] LoViT: Long Video Transformer for surgical phase recognition
    Liu, Yang
    Boels, Maxence
    Garcia-Peraza-Herrera, Luis C.
    Vercauteren, Tom
    Dasgupta, Prokar
    Granados, Alejandro
    Ourselin, Sebastien
    MEDICAL IMAGE ANALYSIS, 2025, 99
  • [35] Second-order transformer network for video recognition
    Zhang, Bingbing
    Dong, Wei
    Wang, Zhenwei
    Zhang, Jianxin
    Sun, Qiule
    ALEXANDRIA ENGINEERING JOURNAL, 2025, 114 : 82 - 94
  • [36] A Method of Simultaneously Action Recognition and Video Segmentation of Video Streams
    Ji, Liang
    Xiong, Rong
    Wang, Yue
    Yu, Hongsheng
    2017 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (IEEE ROBIO 2017), 2017, : 1515 - 1520
  • [37] Short-Term Action Learning for Video Action Recognition
    Ting-Long, Liu
    IEEE Access, 2024, 12 : 30867 - 30875
  • [38] Human action recognition with transformer based on convolutional features
    Shi, Chengcheng
    Liu, Shuxin
    INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS, 2024, 18 (02): : 881 - 896
  • [39] AGPN: Action Granularity Pyramid Network for Video Action Recognition
    Chen, Yatong
    Ge, Hongwei
    Liu, Yuxuan
    Cai, Xinye
    Sun, Liang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 3912 - 3923
  • [40] Live Video Action Recognition from Unsupervised Action Proposals
    Lopcz-Sastrc, Roberto J.
    Baptista-Rios, Marcos
    Acevedo-Rodriguez, Francisco J.
    Martin-Martin, Pilar
    Maldonado-Bascon, Saturnino
    PROCEEDINGS OF 17TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA 2021), 2021,