Skeleton-weighted and multi-scale temporal-driven network for video action recognition

被引:0
|
作者
Xu, Ziqi [1 ]
Zhang, Jie [2 ,3 ]
Zhang, Peng [2 ,3 ]
Ding, Pengfei [4 ]
机构
[1] Donghua Univ, Coll Comp Sci & Technol, Shanghai, Peoples R China
[2] Minist Educ, Engn Res Ctr Digitalized Textile & Fash Technol, Shanghai, Peoples R China
[3] Donghua Univ, Shanghai Engn Res Ctr Ind Big Data & Intelligent, Inst Artificial Intelligence, Shanghai, Peoples R China
[4] Donghua Univ, Coll Mech Engn, Shanghai, Peoples R China
关键词
video action recognition; multi-model; feature extraction; temporal modeling; feature fusion; RGB;
D O I
10.1117/1.JEI.33.6.063056
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Sequential and causal relationships among actions are critical for accurate video interpretation. Therefore, capturing both short-term and long-term temporal information is essential for effective action recognition. Current research, however, primarily focuses on fusing spatial features from diverse modalities for short-term action recognition, inadequately modeling the complex temporal dependencies in videos, leading to suboptimal performance. To address this limitation, we propose a skeleton-weighted and multi-scale temporal-driven action recognition network that integrates RGB and skeleton modalities to effectively capture both short-term and long-term temporal information. First, we propose a temporal-enhanced adaptive graph convolutional network. This network derives motion attention masks from the skeletal joints and transfers them to RGB videos to generate visually salient regions, thereby achieving a concise and effective input representation. Subsequently, we develop a multi-scale local-global temporal modeling network driven by a self-attention mechanism, which effectively captures fine-grained local details of individual actions along with global temporal relationships among actions across multiple temporal resolutions. Moreover, we design a multi-level adaptive temporal scale mixer module that efficiently integrates multi-scale features, creating a unified temporal feature representation to ensure temporal consistency. Finally, we conducted extensive experiments on the NTU-RGBD-60, NTU-RGBD-120, NW-UCLA, and Kinetics datasets to validate the effectiveness of the proposed method. (c) 2024 SPIE and IS&T
引用
收藏
页数:23
相关论文
共 50 条
  • [1] Physical Knowledge Driven Multi-scale Temporal Receptive Field Network for Compressed Video Action Recognition
    He, Lijun
    Zhang, Miao
    Zhang, Sijin
    Li, Fan
    UBICOMP/ISWC '21 ADJUNCT: PROCEEDINGS OF THE 2021 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2021 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS, 2021, : 625 - 630
  • [2] A Multi-Scale Video Longformer Network for Action Recognition
    Chen, Congping
    Zhang, Chunsheng
    Dong, Xin
    APPLIED SCIENCES-BASEL, 2024, 14 (03):
  • [3] Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition
    Qin Cheng
    Jun Cheng
    Ziliang Ren
    Qieshi Zhang
    Jianming Liu
    Pattern Analysis and Applications, 2023, 26 (3) : 1303 - 1315
  • [4] Multi-Scale Spatial Temporal Graph Neural Network for Skeleton-Based Action Recognition
    Feng, Dong
    Wu, ZhongCheng
    Zhang, Jun
    Ren, TingTing
    IEEE ACCESS, 2021, 9 : 58256 - 58265
  • [5] Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition
    Chen, Zhan
    Li, Sicheng
    Yang, Bing
    Li, Qinghan
    LiU, Hong
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1113 - 1122
  • [6] Multi-scale spatial-temporal convolutional neural network for skeleton-based action recognition
    Cheng, Qin
    Cheng, Jun
    Ren, Ziliang
    Zhang, Qieshi
    Liu, Jianming
    PATTERN ANALYSIS AND APPLICATIONS, 2023, 26 (03) : 1303 - 1315
  • [7] MTT: Multi-Scale Temporal Transformer for Skeleton-Based Action Recognition
    Kong, Jun
    Bian, Yuhang
    Jiang, Min
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 528 - 532
  • [8] Multi-scale skeleton adaptive weighted GCN for skeleton-based human action recognition in IoT
    Xu Weiyao
    Wu Muqing
    Zhu Jie
    Zhao Min
    APPLIED SOFT COMPUTING, 2021, 104
  • [9] Multi-scale skeleton simplification graph convolutional network for skeleton-based action recognition
    Fan, Zhang
    Ding, Chongyang
    Kai, Liu
    Liu, Hongjin
    IET COMPUTER VISION, 2024, 18 (07) : 992 - 1003
  • [10] Multi-scale Spatiotemporal Information Fusion Network for Video Action Recognition
    Cai, Yutong
    Lin, Weiyao
    See, John
    Cheng, Ming-Ming
    Liu, Guangcan
    Xiong, Hongkai
    2018 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (IEEE VCIP), 2018,