STAN: Spatio-Temporal Analysis Network for efficient video action recognition

被引:0
|
作者
Chen, Shilin [1 ,2 ]
Wang, Xingwang [1 ,3 ,4 ]
Sun, Yafeng [1 ]
Yan, Kun [5 ,6 ,7 ,8 ,9 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Changchun 130012, Peoples R China
[2] Jilin Univ, Coll Software Engn, Changchun, Peoples R China
[3] Jilin Univ, Key Lab Symbol Computat Knowledge Engn, Minist Educ, Changchun 130012, Peoples R China
[4] Jilin Univ, Engn, Changchun, Peoples R China
[5] Univ Essex, Sch Comp Sci & Elect Engn, Colchester CO4 3SQ, England
[6] Acad Europaea, MAE, Colchester, England
[7] IET Inst Engn & Technol, London, England
[8] BCS British Comp Soc, London, England
[9] ACM, London, England
关键词
Video understanding; Action recognition; Attention mechanism; Model optimization; Network construction;
D O I
10.1016/j.eswa.2024.126255
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action recognition, whose goal is identifying and extracting spatio-temporal features from video content, is a foundation of work in video understanding. However, current methods for sampling these features are computationally intensive and necessitate complex model architectures. In this paper, we improve 2D CNNs for action recognition tasks, keeping the model streamlined while making effective and complementary use of spatio-temporal features of videos. We propose a model, the Spatio-Temporal Analysis Network (STAN), which strikes a balance between model complexity and recognition accuracy. It contains two key components that affect spatio-temporal features: Temporal Embedding Head (TEH) and Spatio-Temporal Attention (STA). TEH introduces a differential analysis of actions, allowing the model to capture subtle temporal changes and enhance its representational capabilities. STA offers a novel perspective on video streams, improving the spatiotemporal representation without significantly increasing computational demands. It achieves this through a stylized spatial analysis of the features that differ from the conventional optical flow and depth map methods. The results in four datasets demonstrate our methodology's remarkable efficiency and accuracy. Compared to 3D CNNs, our method improves action recognition accuracy by 1.2% and reduces computational costs by 30%. With a dataset utilization rate of only 20% from UCF101, our model achieves an accuracy of 91.68%.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Efficient spatio-temporal network for action recognition
    Su, Yanxiong
    Zhao, Qian
    JOURNAL OF REAL-TIME IMAGE PROCESSING, 2024, 21 (05)
  • [2] Spatio-temporal Video Autoencoder for Human Action Recognition
    Sousa e Santos, Anderson Carlos
    Pedrini, Helio
    PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 5, 2019, : 114 - 123
  • [3] Exploiting spatio-temporal knowledge for video action recognition
    Zhang, Huigang
    Wang, Liuan
    Sun, Jun
    IET COMPUTER VISION, 2023, 17 (02) : 222 - 230
  • [4] Interpretable Spatio-temporal Attention for Video Action Recognition
    Meng, Lili
    Zhao, Bo
    Chang, Bo
    Huang, Gao
    Sun, Wei
    Tung, Frederich
    Sigal, Leonid
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1513 - 1522
  • [5] STAN: Spatio-Temporal Alignment Network for No-Reference Video Quality Assessment
    Yang, Zhengyi
    Dang, Yuanjie
    Xiang, Jianjun
    Chen, Peng
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT III, 2023, 14256 : 160 - 171
  • [6] Spatio-Temporal Collaborative Module for Efficient Action Recognition
    Hao, Yanbin
    Wang, Shuo
    Tan, Yi
    He, Xiangnan
    Liu, Zhenguang
    Wang, Meng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 7279 - 7291
  • [7] STHARNet: spatio-temporal human action recognition network in content based video retrieval
    Sowmyayani, S.
    Rani, P. Arockia Jansi
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 82 (24) : 38051 - 38066
  • [8] STHARNet: spatio-temporal human action recognition network in content based video retrieval
    S. Sowmyayani
    P. Arockia Jansi Rani
    Multimedia Tools and Applications, 2023, 82 : 38051 - 38066
  • [9] SiamMAST: Siamese motion-aware spatio-temporal network for video action recognition
    Xuemin Lu
    Wei Quan
    Reformat Marek
    Haiquan Zhao
    Jim X. Chen
    The Visual Computer, 2024, 40 : 3163 - 3181
  • [10] SiamMAST: Siamese motion-aware spatio-temporal network for video action recognition
    Lu, Xuemin
    Quan, Wei
    Marek, Reformat
    Zhao, Haiquan
    Chen, Jim X. X.
    VISUAL COMPUTER, 2024, 40 (05): : 3163 - 3181