EPAM-Net: An efficient pose-driven attention-guided multimodal network for video action recognition

被引：0

作者：

Abdelkawy, Ahmed ^{[1
]}

Ali, Asem ^{[1
]}

Farag, Aly ^{[1
]}

机构：

[1] Univ Louisville, Comp Vis & Image Proc Lab CVIP, Louisville, KY 40292 USA

来源：

NEUROCOMPUTING | 2025年 / 633卷

关键词：

Human action recognition; Multimodal learning; X3D network; X-shiftNet; Spatial-temporal attention; Activities of daily living;

D O I：

10.1016/j.neucom.2025.129781

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing multimodal-based human action recognition approaches are computationally intensive, limiting their deployment in real-time applications. In this work, we present a novel and efficient pose-driven attention- guided multimodal network (EPAM-Net) for action recognition in videos. Specifically, we propose eXpand temporal Shift (X-ShiftNet) convolutional architectures for RGB and pose streams to capture spatio-temporal features from RGB videos and their skeleton sequences. The X-ShiftNet tackles the high computational cost of the 3D CNNs by integrating the Temporal Shift Module (TSM) into an efficient 2D CNN, enabling efficient spatiotemporal learning. Then skeleton features are utilized to guide the visual network stream, focusing on keyframes and their salient spatial regions using the proposed spatial-temporal attention block. Finally, the predictions of the two streams are fused for final classification. The experimental results show that our method, with a significant reduction in floating-point operations (FLOPs), outperforms and competes with the stateof-the-art methods on NTU RGB-D 60, NTU RGB-D 120, PKU-MMD, and Toyota SmartHome datasets. The proposed EPAM-Net provides up to a 72.8x reduction in FLOPs and up to a 48.6x reduction in the number of network parameters. The code will be available at https://github.com/ahmed-nady/Multimodal-ActionRecognition.

引用

页数：10

共 12 条

[1] Pose-driven attention-guided image generation for person re-Identification
Khatun, Amena
Denman, Simon
Sridharan, Sridha
Fookes, Clinton
PATTERN RECOGNITION, 2023, 137
[2] A hybrid attention-guided ConvNeXt-GRU network for action recognition
An, Yiyuan
Yi, Yingmin
Han, Xiaoyong
Wu, Li
Su, Chunyi
Liu, Bojun
Xue, Xianghong
Li, Yankai
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
[3] SGM-Net: Skeleton-guided multimodal network for action recognition
Li, Jianan
Xie, Xuemei
Pan, Qingzhe
Cao, Yuhan
Zhao, Zhifu
Shi, Guangming
PATTERN RECOGNITION, 2020, 104 (104)
[4] LAE-Net: Light and Efficient Network for Compressed Video Action Recognition
Guo, Jinxin
Zhang, Jiaqiang
Zhang, Xiaojing
Ma, Ming
MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 265 - 276
[5] Attention-Guided and Topology-Enhanced Shift Graph Convolutional Network for Skeleton-Based Action Recognition
Lu, Chenghong
Chen, Hongbo
Li, Menglei
Jing, Lei
ELECTRONICS, 2024, 13 (18)
[6] AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding
Wang, Bin
Liu, Chunsheng
Chang, Faliang
Wang, Wenqian
Li, Nanjun
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5458 - 5468
[7] AAEE-Net: Attention-guided aggregation and error-aware enhancement network for accurate and efficient stereo matching
Liu, Yujun
Zhang, Xiangchen
Su, Jinhe
Cai, Guorong
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (22):
[8] Workout Action Recognition in Video Streams Using an Attention Driven Residual DC-GRU Network
Dey, Arnab
Biswas, Samit
Le, Dac-Nhuong
CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 79 (02): : 3067 - 3087
[9] Towards efficient video-based action recognition: context-aware memory attention network
Koh, Thean Chun
Yeo, Chai Kiat
Jing, Xuan
Sivadas, Sunil
SN APPLIED SCIENCES, 2023, 5 (12):
[10] Towards efficient video-based action recognition: context-aware memory attention network
Thean Chun Koh
Chai Kiat Yeo
Xuan Jing
Sunil Sivadas
SN Applied Sciences, 2023, 5

← 1 2 →