STAN: Spatio-Temporal Analysis Network for efficient video action recognition

被引：0

作者：

Chen, Shilin ^{[1
,2
]}

Wang, Xingwang ^{[1
,3
,4
]}

Sun, Yafeng ^{[1
]}

Yan, Kun ^{[5
,6
,7
,8
,9
]}

机构：

[1] Jilin Univ, Coll Comp Sci & Technol, Changchun 130012, Peoples R China

[2] Jilin Univ, Coll Software Engn, Changchun, Peoples R China

[3] Jilin Univ, Key Lab Symbol Computat Knowledge Engn, Minist Educ, Changchun 130012, Peoples R China

[4] Jilin Univ, Engn, Changchun, Peoples R China

[5] Univ Essex, Sch Comp Sci & Elect Engn, Colchester CO4 3SQ, England

[6] Acad Europaea, MAE, Colchester, England

[7] IET Inst Engn & Technol, London, England

[8] BCS British Comp Soc, London, England

[9] ACM, London, England

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2025年 / 268卷

关键词：

Video understanding; Action recognition; Attention mechanism; Model optimization; Network construction;

D O I：

10.1016/j.eswa.2024.126255

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Action recognition, whose goal is identifying and extracting spatio-temporal features from video content, is a foundation of work in video understanding. However, current methods for sampling these features are computationally intensive and necessitate complex model architectures. In this paper, we improve 2D CNNs for action recognition tasks, keeping the model streamlined while making effective and complementary use of spatio-temporal features of videos. We propose a model, the Spatio-Temporal Analysis Network (STAN), which strikes a balance between model complexity and recognition accuracy. It contains two key components that affect spatio-temporal features: Temporal Embedding Head (TEH) and Spatio-Temporal Attention (STA). TEH introduces a differential analysis of actions, allowing the model to capture subtle temporal changes and enhance its representational capabilities. STA offers a novel perspective on video streams, improving the spatiotemporal representation without significantly increasing computational demands. It achieves this through a stylized spatial analysis of the features that differ from the conventional optical flow and depth map methods. The results in four datasets demonstrate our methodology's remarkable efficiency and accuracy. Compared to 3D CNNs, our method improves action recognition accuracy by 1.2% and reduces computational costs by 30%. With a dataset utilization rate of only 20% from UCF101, our model achieves an accuracy of 91.68%.

引用

页数：15

共 50 条

[1] Efficient spatio-temporal network for action recognition
Su, Yanxiong
Zhao, Qian
JOURNAL OF REAL-TIME IMAGE PROCESSING, 2024, 21 (05)
[2] Spatio-temporal Video Autoencoder for Human Action Recognition
Sousa e Santos, Anderson Carlos
Pedrini, Helio
PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 5, 2019, : 114 - 123
[3] Exploiting spatio-temporal knowledge for video action recognition
Zhang, Huigang
Wang, Liuan
Sun, Jun
IET COMPUTER VISION, 2023, 17 (02) : 222 - 230
[4] Interpretable Spatio-temporal Attention for Video Action Recognition
Meng, Lili
Zhao, Bo
Chang, Bo
Huang, Gao
Sun, Wei
Tung, Frederich
Sigal, Leonid
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1513 - 1522
[5] STAN: Spatio-Temporal Alignment Network for No-Reference Video Quality Assessment
Yang, Zhengyi
Dang, Yuanjie
Xiang, Jianjun
Chen, Peng
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT III, 2023, 14256 : 160 - 171
[6] Spatio-Temporal Collaborative Module for Efficient Action Recognition
Hao, Yanbin
Wang, Shuo
Tan, Yi
He, Xiangnan
Liu, Zhenguang
Wang, Meng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 7279 - 7291
[7] STHARNet: spatio-temporal human action recognition network in content based video retrieval
Sowmyayani, S.
Rani, P. Arockia Jansi
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 82 (24) : 38051 - 38066
[8] STHARNet: spatio-temporal human action recognition network in content based video retrieval
S. Sowmyayani
P. Arockia Jansi Rani
Multimedia Tools and Applications, 2023, 82 : 38051 - 38066
[9] SiamMAST: Siamese motion-aware spatio-temporal network for video action recognition
Xuemin Lu
Wei Quan
Reformat Marek
Haiquan Zhao
Jim X. Chen
The Visual Computer, 2024, 40 : 3163 - 3181
[10] SiamMAST: Siamese motion-aware spatio-temporal network for video action recognition
Lu, Xuemin
Quan, Wei
Marek, Reformat
Zhao, Haiquan
Chen, Jim X. X.
VISUAL COMPUTER, 2024, 40 (05): : 3163 - 3181

← 1 2 3 4 5 →