SiamMAST: Siamese motion-aware spatio-temporal network for video action recognition

被引:1
|
作者
Lu, Xuemin [1 ,2 ]
Quan, Wei [2 ]
Marek, Reformat [3 ]
Zhao, Haiquan [2 ]
Chen, Jim X. X. [4 ]
机构
[1] Southwest China Inst Elect Technol, Chengdu 610036, Peoples R China
[2] Southwest Jiaotong Univ, Sch Elect Engn, Chengdu 610031, Sichuan, Peoples R China
[3] Univ Alberta, Sch Elect & Comp Engn, Edmonton, AB T6G 1H9, Canada
[4] George Mason Univ, Dept Comp Sci, Fairfax, VA 22030 USA
来源
VISUAL COMPUTER | 2024年 / 40卷 / 05期
基金
中国国家自然科学基金;
关键词
Video action recognition; Siamese network; Spatio-temporal features; Spatial-motion awareness; Temporal-motion awareness; VECTOR;
D O I
10.1007/s00371-023-03018-2
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
This paper proposes a Siamese motion-aware Spatio-temporal network (SiamMAST) for video action recognition. The SiamMAST is designed based on the fusion of four features via processing video frames: spatial features, temporal features, spatial dynamic features, and temporal dynamic features of a moving target. The SiamMAST comprises AlexNets as the backbone, LSTMs, and the spatial motion-awareness and temporal motion-awareness sub-modules. RGB images are fed into the network, where AlexNets extract spatial features. Further, they are fed into LSTMs to generate temporal features. Additionally, spatial motion-awareness and temporal motion-awareness sub-modules are proposed to capture spatial and temporal dynamic features. Finally, all features are fused and fed into the classification layer. The final recognition result is produced by averaging the test label probabilities across a fixed number of RGB frames and selecting the label of the highest probability. The whole network is trained offline using an end-to-end approach with large-scale image datasets using the standard SGD algorithm with back-propagation. The proposed network is evaluated on two challenging datasets UCF101 (93.53%) and HMDB51 (69.36%). The experiments have demonstrated the effectiveness and efficiency of our proposed SiamMAST.
引用
收藏
页码:3163 / 3181
页数:19
相关论文
共 50 条
  • [31] Histogram of Fuzzy Local Spatio-Temporal Descriptors for Video Action Recognition
    Zuo, Zheming
    Yang, Longzhi
    Liu, Yonghuai
    Chao, Fei
    Song, Ran
    Qu, Yanpeng
    [J]. IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2020, 16 (06) : 4059 - 4067
  • [32] Video Action Recognition Based on Spatio-temporal Feature Pyramid Module
    Gong, Suming
    Chen, Ying
    [J]. 2020 13TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2020), 2020, : 338 - 341
  • [33] Learning spatio-temporal features for action recognition from the side of the video
    Pei, Lishen
    Ye, Mao
    Zhao, Xuezhuan
    Xiang, Tao
    Li, Tao
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2016, 10 (01) : 199 - 206
  • [34] VIDEO ACTION RECOGNITION WITH SPATIO-TEMPORAL GRAPH EMBEDDING AND SPLINE MODELING
    Yuan, Yin
    Zheng, Haomian
    Li, Zhu
    Zhang, David
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 2422 - 2425
  • [35] Human Action Recognition in Video by Fusion of Structural and Spatio-temporal Features
    Borzeshi, Ehsan Zare
    Concha, Oscar Perez
    Piccardi, Massimo
    [J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, 2012, 7626 : 474 - 482
  • [36] Learning spatio-temporal features for action recognition from the side of the video
    Lishen Pei
    Mao Ye
    Xuezhuan Zhao
    Tao Xiang
    Tao Li
    [J]. Signal, Image and Video Processing, 2016, 10 : 199 - 206
  • [37] Spatio-Temporal Motion Field Descriptors for The Hierarchical Action Recognition System
    Bao, Ruihan
    Shibata, Tadashi
    [J]. 5TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS, ICSPCS'2011, 2011,
  • [38] Motion-Aware Video Frame Interpolation
    Han, Pengfei
    Zhang, Fuhua
    Zhao, Bin
    Li, Xuelong
    [J]. NEURAL NETWORKS, 2024, 178
  • [39] TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal
    Zhu, Hongyuan
    Vial, Romain
    Lu, Shijian
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5814 - 5822
  • [40] LSN: Long-Term Spatio-Temporal Network for Video Recognition
    Wang, Zhenwei
    Dong, Wei
    Zhang, Bingbing
    Zhang, Jianxin
    [J]. DATA SCIENCE (ICPCSEE 2022), PT I, 2022, 1628 : 326 - 338