STAN: Spatio-Temporal Analysis Network for efficient video action recognition

被引：0

作者：

Chen, Shilin ^{[1
,2
]}

Wang, Xingwang ^{[1
,3
,4
]}

Sun, Yafeng ^{[1
]}

Yan, Kun ^{[5
,6
,7
,8
,9
]}

机构：

[1] Jilin Univ, Coll Comp Sci & Technol, Changchun 130012, Peoples R China

[2] Jilin Univ, Coll Software Engn, Changchun, Peoples R China

[3] Jilin Univ, Key Lab Symbol Computat Knowledge Engn, Minist Educ, Changchun 130012, Peoples R China

[4] Jilin Univ, Engn, Changchun, Peoples R China

[5] Univ Essex, Sch Comp Sci & Elect Engn, Colchester CO4 3SQ, England

[6] Acad Europaea, MAE, Colchester, England

[7] IET Inst Engn & Technol, London, England

[8] BCS British Comp Soc, London, England

[9] ACM, London, England

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2025年 / 268卷

关键词：

Video understanding; Action recognition; Attention mechanism; Model optimization; Network construction;

D O I：

10.1016/j.eswa.2024.126255

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Action recognition, whose goal is identifying and extracting spatio-temporal features from video content, is a foundation of work in video understanding. However, current methods for sampling these features are computationally intensive and necessitate complex model architectures. In this paper, we improve 2D CNNs for action recognition tasks, keeping the model streamlined while making effective and complementary use of spatio-temporal features of videos. We propose a model, the Spatio-Temporal Analysis Network (STAN), which strikes a balance between model complexity and recognition accuracy. It contains two key components that affect spatio-temporal features: Temporal Embedding Head (TEH) and Spatio-Temporal Attention (STA). TEH introduces a differential analysis of actions, allowing the model to capture subtle temporal changes and enhance its representational capabilities. STA offers a novel perspective on video streams, improving the spatiotemporal representation without significantly increasing computational demands. It achieves this through a stylized spatial analysis of the features that differ from the conventional optical flow and depth map methods. The results in four datasets demonstrate our methodology's remarkable efficiency and accuracy. Compared to 3D CNNs, our method improves action recognition accuracy by 1.2% and reduces computational costs by 30%. With a dataset utilization rate of only 20% from UCF101, our model achieves an accuracy of 91.68%.

引用

页数：15

共 50 条

[31] Video spatio-temporal generative adversarial network for local action generation
Liu, Xuejun
Guo, Jiacheng
Cui, Zhongji
Liu, Ling
Yan, Yong
Sha, Yun
JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (05)
[32] A Spatio-Temporal Motion Network for Action Recognition Based on Spatial Attention
Yang, Qi
Lu, Tongwei
Zhou, Huabing
ENTROPY, 2022, 24 (03)
[33] A fast human action recognition network based on spatio-temporal features
Xu, Jie
Song, Rui
Wei, Haoliang
Guo, Jinhong
Zhou, Yifei
Huang, Xiwei
NEUROCOMPUTING, 2021, 441 : 350 - 358
[34] A fast human action recognition network based on spatio-temporal features
Xu, Jie
Song, Rui
Wei, Haoliang
Guo, Jinhong
Zhou, Yifei
Huang, Xiwei
Neurocomputing, 2021, 441 : 350 - 358
[35] MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module
Zhang, Yi
SENSORS, 2022, 22 (17)
[36] SPATIO-TEMPORAL SLOWFAST SELF-ATTENTION NETWORK FOR ACTION RECOGNITION
Kim, Myeongjun
Kim, Taehun
Kim, Daijin
2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2206 - 2210
[37] IQ-STAN: IMAGE QUALITY GUIDED SPATIO-TEMPORAL ATTENTION NETWORK FOR LICENSE PLATE RECOGNITION
Zhang, Cong
Wang, Qi
Li, Xuelong
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 2268 - 2272
[38] Action Recognition with Multiscale Spatio-Temporal Contexts
Wang, Jiang
Chen, Zhuoyuan
Wu, Ying
2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011,
[39] Action recognition by spatio-temporal oriented energies
Zhen, Xiantong
Shao, Ling
Li, Xuelong
INFORMATION SCIENCES, 2014, 281 : 295 - 309
[40] Spatio-temporal information for human action recognition
Yao, Li
Liu, Yunjian
Huang, Shihui
EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2016,

← 1 2 3 4 5 →