A multidimensional feature fusion network based on MGSE and TAAC for video-based human action recognition

被引:5
|
作者
Zhou, Shuang [1 ]
Xu, Hongji [1 ]
Bai, Zhiquan [1 ]
Du, Zhengfeng [1 ]
Zeng, Jiaqi [1 ]
Wang, Yang [1 ]
Wang, Yuhao [1 ]
Li, Shijie [1 ]
Wang, Mengmeng [1 ]
Li, Yiran [1 ]
Li, Jianjun [1 ]
Xu, Jie [1 ]
机构
[1] Shandong Univ, Sch Informat Sci & Engn, 72 Binhai Rd, Qingdao 266237, Shandong, Peoples R China
关键词
Human action recognition; Multidimensional feature; Multiscale convolution;
D O I
10.1016/j.neunet.2023.09.031
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the maturity of intelligent technology such as human-computer interaction, human action recognition (HAR) technology has been widely used in virtual reality, video surveillance, and other fields. However, the current video-based HAR methods still cannot fully extract abstract action features, and there is still a lack of action collection and recognition for special personnel such as prisoners and elderly people living alone. To solve the above problems, this paper proposes a multidimensional feature fusion network, called P-MTSC3D, a parallel network based on context modeling and temporal adaptive attention module. It consists of three branches. The first branch serves as the basic network branch, which extracts basic feature information. The second branch consists of a feature pre-extraction layer and two multiscale-convolution-based global context modeling combined squeeze and excitation (MGSE) modules, which can extract spatial and channel features. The third branch consists of two temporal adaptive attention units based on convolution (TAAC) to extract temporal dimension features. In order to verify the validity of the proposed network, this paper conducts experiments on the University of Central Florida (UCF) 101 dataset and the human motion database (HMDB) 51 dataset. The recognition accuracy of the proposed P-MTSC3D network is 97.92% on the UCF101 dataset and 75.59% on the HMDB51 dataset, respectively. The FLOPs of the P-MTSC3D network is 30.85G, and the test time is 2.83 s/16 samples on the UCF101 dataset. The experimental results demonstrate that the P-MTSC3D network has better overall performance than the state-of-the-art networks. In addition, a prison action (PA) dataset is constructed in this paper to verify the application effect of the proposed network in actual scenarios.
引用
收藏
页码:496 / 507
页数:12
相关论文
共 50 条
  • [1] FENet: An Efficient Feature Excitation Network for Video-based Human Action Recognition
    Zhang, Zhan
    Jin, Yi
    Feng, Songhe
    Li, Yidong
    Wang, Tao
    Tian, Hui
    2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 540 - 544
  • [2] Diverse Features Fusion Network for video-based action recognition
    Deng, Haoyang
    Kong, Jun
    Jiang, Min
    Liu, Tianshan
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 77
  • [3] CANet: Comprehensive Attention Network for video-based action recognition
    Gao, Xiong
    Chang, Zhaobin
    Ran, Xingcheng
    Lu, Yonggang
    KNOWLEDGE-BASED SYSTEMS, 2024, 296
  • [4] Research on Video-Based Human Action Behavior Recognition Algorithms
    Si, Haifei
    Hu, Xingliu
    Wang, Yizhi
    2019 5TH INTERNATIONAL CONFERENCE ON ENVIRONMENTAL SCIENCE AND MATERIAL APPLICATION, 2020, 440
  • [5] A survey of video-based human action recognition in team sports
    Yin, Hongwei
    Sinnott, Richard O.
    Jayaputera, Glenn T.
    ARTIFICIAL INTELLIGENCE REVIEW, 2024, 57 (11)
  • [6] Human action recognition based on multiple feature fusion
    1600, AMSE Press, 16 Avenue Grauge Blanche, Tassin-la-Demi-Lune, 69160, France (60):
  • [7] Multi-Attention Fusion Network for Video-based Emotion Recognition
    Wang, Yanan
    Wu, Jianming
    Hoashi, Keiichiro
    ICMI'19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2019, : 595 - 601
  • [8] Feature fusion of side face and gait for video-based human identification
    Zhou, Xiaoli
    Bhanu, Bir
    PATTERN RECOGNITION, 2008, 41 (03) : 778 - 795
  • [9] Action Recognition of Temporal Segment Network Based on Feature Fusion
    Li H.
    Ding Y.
    Li C.
    Zhang S.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2020, 57 (01): : 145 - 158
  • [10] Sensor Substitution for Video-based Action Recognition
    Rupprecht, Christian
    Lea, Colin
    Tombari, Federico
    Navab, Nassir
    Hager, Gregory D.
    2016 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2016), 2016, : 5230 - 5237