WLiT: Windows and Linear Transformer for Video Action Recognition

被引:4
|
作者
Sun, Ruoxi [1 ,2 ]
Zhang, Tianzhao [1 ,3 ]
Wan, Yong [4 ]
Zhang, Fuping [1 ]
Wei, Jianming [1 ]
机构
[1] Chinese Acad Sci, Shanghai Adv Res Inst, Shanghai 201210, Peoples R China
[2] Shanghai Tech Univ, Sch Informat Sci & Technol, Shanghai 201210, Peoples R China
[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100049, Peoples R China
[4] Chinese Acad Sci, Inst Rock & Soil Mech, State Key Lab Geomech & Geotech Engn, Wuhan 430071, Peoples R China
基金
中国国家自然科学基金;
关键词
action recognition; Spatial-Windows attention; linear attention; self-attention; transformer;
D O I
10.3390/s23031616
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
The emergence of Transformer has led to the rapid development of video understanding, but it also brings the problem of high computational complexity. Previously, there were methods to divide the feature maps into windows along the spatiotemporal dimensions and then calculate the attention. There are also methods to perform down-sampling during attention computation to reduce the spatiotemporal resolution of features. Although the complexity is effectively reduced, there is still room for further optimization. Thus, we present the Windows and Linear Transformer (WLiT) for efficient video action recognition, by combining Spatial-Windows attention with Linear attention. We first divide the feature maps into multiple windows along the spatial dimensions and calculate the attention separately inside the windows. Therefore, our model further reduces the computational complexity compared with previous methods. However, the perceptual field of Spatial-Windows attention is small, and global spatiotemporal information cannot be obtained. To address this problem, we then calculate Linear attention along the channel dimension so that the model can capture complete spatiotemporal information. Our method achieves better recognition accuracy with less computational complexity through this mechanism. We conduct extensive experiments on four public datasets, namely Something-Something V2 (SSV2), Kinetics400 (K400), UCF101, and HMDB51. On the SSV2 dataset, our method reduces the computational complexity by 28% and improves the recognition accuracy by 1.6% compared to the State-Of-The-Art (SOTA) method. On the K400 and two other datasets, our method achieves SOTA-level accuracy while reducing the complexity by about 49%.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Recurring the Transformer for Video Action Recognition
    Yang, Jiewen
    Dong, Xingbo
    Liu, Liujun
    Zhang, Chao
    Shen, Jiajun
    Yu, Dahai
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14043 - 14053
  • [2] Sparse Dense Transformer Network for Video Action Recognition
    Qu, Xiaochun
    Zhang, Zheyuan
    Xiao, Wei
    Ran, Jinye
    Wang, Guodong
    Zhang, Zili
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, 2022, 13369 : 43 - 56
  • [3] TWO-PATHWAY TRANSFORMER NETWORK FOR VIDEO ACTION RECOGNITION
    Jiang, Bo
    Yu, Jiahong
    Zhou, Lei
    Wu, Kailin
    Yang, Yang
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1089 - 1093
  • [4] SVFormer: Semi-supervised Video Transformer for Action Recognition
    Xing, Zhen
    Dai, Qi
    Hu, Han
    Chen, Jingjing
    Wu, Zuxuan
    Jiang, Yu-Gang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18816 - 18826
  • [5] FSformer: Fast-Slow Transformer for video action recognition
    Li, Shibao
    Wang, Zhaoyu
    Liu, Yixuan
    Zhang, Yunwu
    Zhu, Jinze
    Cui, Xuerong
    Liu, Jianhang
    IMAGE AND VISION COMPUTING, 2023, 137
  • [6] Temporal Shift Vision Transformer Adapter for Efficient Video Action Recognition
    Shi, Yaning
    Sun, Pu
    Gu, Bing
    Li, Longfei
    PROCEEDINGS OF 2024 4TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND INTELLIGENT COMPUTING, BIC 2024, 2024, : 42 - 46
  • [7] MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
    Chen, Jiawei
    Ho, Chiu Man
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 786 - 797
  • [8] Video Action Transformer Network
    Girdhar, Rohit
    Carreira, Joao
    Doersch, Carl
    Zisserman, Andrew
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 244 - 253
  • [9] An efficient video transformer network with token discard and keyframe enhancement for action recognition
    Zhang, Qian
    Yang, Zuosui
    Shao, Mingwen
    Liang, Hong
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (02):
  • [10] Video action recognition collaborative learning with dynamics via PSO-ConvNet Transformer
    Huu Phong Nguyen
    Ribeiro, Bernardete
    SCIENTIFIC REPORTS, 2023, 13 (01)