STAN: Spatio-Temporal Analysis Network for efficient video action recognition

被引:0
|
作者
Chen, Shilin [1 ,2 ]
Wang, Xingwang [1 ,3 ,4 ]
Sun, Yafeng [1 ]
Yan, Kun [5 ,6 ,7 ,8 ,9 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Changchun 130012, Peoples R China
[2] Jilin Univ, Coll Software Engn, Changchun, Peoples R China
[3] Jilin Univ, Key Lab Symbol Computat Knowledge Engn, Minist Educ, Changchun 130012, Peoples R China
[4] Jilin Univ, Engn, Changchun, Peoples R China
[5] Univ Essex, Sch Comp Sci & Elect Engn, Colchester CO4 3SQ, England
[6] Acad Europaea, MAE, Colchester, England
[7] IET Inst Engn & Technol, London, England
[8] BCS British Comp Soc, London, England
[9] ACM, London, England
关键词
Video understanding; Action recognition; Attention mechanism; Model optimization; Network construction;
D O I
10.1016/j.eswa.2024.126255
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action recognition, whose goal is identifying and extracting spatio-temporal features from video content, is a foundation of work in video understanding. However, current methods for sampling these features are computationally intensive and necessitate complex model architectures. In this paper, we improve 2D CNNs for action recognition tasks, keeping the model streamlined while making effective and complementary use of spatio-temporal features of videos. We propose a model, the Spatio-Temporal Analysis Network (STAN), which strikes a balance between model complexity and recognition accuracy. It contains two key components that affect spatio-temporal features: Temporal Embedding Head (TEH) and Spatio-Temporal Attention (STA). TEH introduces a differential analysis of actions, allowing the model to capture subtle temporal changes and enhance its representational capabilities. STA offers a novel perspective on video streams, improving the spatiotemporal representation without significantly increasing computational demands. It achieves this through a stylized spatial analysis of the features that differ from the conventional optical flow and depth map methods. The results in four datasets demonstrate our methodology's remarkable efficiency and accuracy. Compared to 3D CNNs, our method improves action recognition accuracy by 1.2% and reduces computational costs by 30%. With a dataset utilization rate of only 20% from UCF101, our model achieves an accuracy of 91.68%.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] Video spatio-temporal generative adversarial network for local action generation
    Liu, Xuejun
    Guo, Jiacheng
    Cui, Zhongji
    Liu, Ling
    Yan, Yong
    Sha, Yun
    JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (05)
  • [32] A Spatio-Temporal Motion Network for Action Recognition Based on Spatial Attention
    Yang, Qi
    Lu, Tongwei
    Zhou, Huabing
    ENTROPY, 2022, 24 (03)
  • [33] A fast human action recognition network based on spatio-temporal features
    Xu, Jie
    Song, Rui
    Wei, Haoliang
    Guo, Jinhong
    Zhou, Yifei
    Huang, Xiwei
    NEUROCOMPUTING, 2021, 441 : 350 - 358
  • [34] A fast human action recognition network based on spatio-temporal features
    Xu, Jie
    Song, Rui
    Wei, Haoliang
    Guo, Jinhong
    Zhou, Yifei
    Huang, Xiwei
    Neurocomputing, 2021, 441 : 350 - 358
  • [35] MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module
    Zhang, Yi
    SENSORS, 2022, 22 (17)
  • [36] SPATIO-TEMPORAL SLOWFAST SELF-ATTENTION NETWORK FOR ACTION RECOGNITION
    Kim, Myeongjun
    Kim, Taehun
    Kim, Daijin
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2206 - 2210
  • [37] IQ-STAN: IMAGE QUALITY GUIDED SPATIO-TEMPORAL ATTENTION NETWORK FOR LICENSE PLATE RECOGNITION
    Zhang, Cong
    Wang, Qi
    Li, Xuelong
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 2268 - 2272
  • [38] Action Recognition with Multiscale Spatio-Temporal Contexts
    Wang, Jiang
    Chen, Zhuoyuan
    Wu, Ying
    2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011,
  • [39] Action recognition by spatio-temporal oriented energies
    Zhen, Xiantong
    Shao, Ling
    Li, Xuelong
    INFORMATION SCIENCES, 2014, 281 : 295 - 309
  • [40] Spatio-temporal information for human action recognition
    Yao, Li
    Liu, Yunjian
    Huang, Shihui
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2016,