SSRT: A Sequential Skeleton RGB Transformer to Recognize Fine-Grained Human-Object Interactions and Action Recognition

被引：7

作者：

Ghimire, Akash ^{[1
]}

Kakani, Vijay ^{[1
]}

Kim, Hakil ^{[2
]}

机构：

[1] Inha Univ, Sch Global Convergence Studies, Dept Integrated Syst Engn, Incheon 402751, South Korea

[2] Inha Univ, Dept Informat & Commun Engn, Incheon 402751, South Korea

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Human factors; Transformers; Task analysis; Feature extraction; Human activity recognition; Solid modeling; Multimodality fusion; human action recognition; fine-grained actions; transformer cross-attention fusion;

D O I：

10.1109/ACCESS.2023.3278974

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Combining skeleton and RGB modalities in human action recognition (HAR) has garnered attention due to their ability to complement each other. However, previous studies did not address the challenge of recognizing fine-grained human-object interaction (HOI). To tackle this problem, this study introduces a new transformer-based architecture called Sequential Skeleton RGB Transformer (SSRT), which fuses skeleton and RGB modalities. First, SSRT leverages the strength of Long Short-Term Memory (LSTM) and a multi-head attention mechanism to extract high-level features from both modalities. Subsequently, SSRT employs a two-stage fusion method, including transformer cross-attention fusion and softmax layer late score fusion, to effectively integrate the multimodal features. Aside from evaluating the proposed method on fine-grained HOI, this study also assesses its performance on two other action recognition tasks: general HAR and cross-dataset HAR. Furthermore, this study conducts a performance comparison between a HAR model using single-modality features (RGB and skeleton) alongside multimodality features on all three action recognition tasks. To ensure a fair comparison, comparable state-of-the-art transformer architectures are employed for both the single-modality HAR model and SSRT. In terms of modality, SSRT outperforms the best-performing single-modality HAR model on all three tasks, with accuracy improved by 9.92% on fine-grained HOI recognition, 6.73% on general HAR, and 11.08% on cross-dataset HAR. Additionally, the proposed fusion model surpasses state-of-the-art multimodal fusion techniques like Transformer Early Concatenation, with an accuracy improved by 6.32% on fine-grained HOI recognition, 4.04% on general HAR, and 6.56% on cross-dataset.

引用

页码：51930 / 51948

页数：19

共 25 条

[21] Combining skeleton and accelerometer data for human fine-grained activity recognition and abnormal behaviour detection with deep temporal convolutional networks
Cuong Pham
Linh Nguyen
Anh Nguyen
Ngon Nguyen
Van-Toi Nguyen
[J]. Multimedia Tools and Applications, 2021, 80 : 28919 - 28940
[22] Combining skeleton and accelerometer data for human fine-grained activity recognition and abnormal behaviour detection with deep temporal convolutional networks
Pham, Cuong
Nguyen, Linh
Nguyen, Anh
Nguyen, Ngon
Nguyen, Van-Toi
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (19) : 28919 - 28940
[23] Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models
Xiong, Songsong
Tziafas, Georgios
Kasaei, Hamidreza
[J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 5751 - 5757
[24] VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition
Yaohui Sun
Weiyao Xu
Xiaoyi Yu
Ju Gao
[J]. Multimedia Tools and Applications, 2024, 83 (29) : 73391 - 73405
[25] Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos
Ma, Miao
Marturi, Naresh
Li, Yibin
Leonardis, Ales
Stolkin, Rustam
[J]. PATTERN RECOGNITION, 2018, 76 : 506 - 521

← 1 2 3 →