SSRT: A Sequential Skeleton RGB Transformer to Recognize Fine-Grained Human-Object Interactions and Action Recognition

被引:7
|
作者
Ghimire, Akash [1 ]
Kakani, Vijay [1 ]
Kim, Hakil [2 ]
机构
[1] Inha Univ, Sch Global Convergence Studies, Dept Integrated Syst Engn, Incheon 402751, South Korea
[2] Inha Univ, Dept Informat & Commun Engn, Incheon 402751, South Korea
关键词
Human factors; Transformers; Task analysis; Feature extraction; Human activity recognition; Solid modeling; Multimodality fusion; human action recognition; fine-grained actions; transformer cross-attention fusion;
D O I
10.1109/ACCESS.2023.3278974
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Combining skeleton and RGB modalities in human action recognition (HAR) has garnered attention due to their ability to complement each other. However, previous studies did not address the challenge of recognizing fine-grained human-object interaction (HOI). To tackle this problem, this study introduces a new transformer-based architecture called Sequential Skeleton RGB Transformer (SSRT), which fuses skeleton and RGB modalities. First, SSRT leverages the strength of Long Short-Term Memory (LSTM) and a multi-head attention mechanism to extract high-level features from both modalities. Subsequently, SSRT employs a two-stage fusion method, including transformer cross-attention fusion and softmax layer late score fusion, to effectively integrate the multimodal features. Aside from evaluating the proposed method on fine-grained HOI, this study also assesses its performance on two other action recognition tasks: general HAR and cross-dataset HAR. Furthermore, this study conducts a performance comparison between a HAR model using single-modality features (RGB and skeleton) alongside multimodality features on all three action recognition tasks. To ensure a fair comparison, comparable state-of-the-art transformer architectures are employed for both the single-modality HAR model and SSRT. In terms of modality, SSRT outperforms the best-performing single-modality HAR model on all three tasks, with accuracy improved by 9.92% on fine-grained HOI recognition, 6.73% on general HAR, and 11.08% on cross-dataset HAR. Additionally, the proposed fusion model surpasses state-of-the-art multimodal fusion techniques like Transformer Early Concatenation, with an accuracy improved by 6.32% on fine-grained HOI recognition, 4.04% on general HAR, and 6.56% on cross-dataset.
引用
收藏
页码:51930 / 51948
页数:19
相关论文
共 25 条
  • [21] Combining skeleton and accelerometer data for human fine-grained activity recognition and abnormal behaviour detection with deep temporal convolutional networks
    Cuong Pham
    Linh Nguyen
    Anh Nguyen
    Ngon Nguyen
    Van-Toi Nguyen
    [J]. Multimedia Tools and Applications, 2021, 80 : 28919 - 28940
  • [22] Combining skeleton and accelerometer data for human fine-grained activity recognition and abnormal behaviour detection with deep temporal convolutional networks
    Pham, Cuong
    Nguyen, Linh
    Nguyen, Anh
    Nguyen, Ngon
    Nguyen, Van-Toi
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (19) : 28919 - 28940
  • [23] Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models
    Xiong, Songsong
    Tziafas, Georgios
    Kasaei, Hamidreza
    [J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 5751 - 5757
  • [24] VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition
    Yaohui Sun
    Weiyao Xu
    Xiaoyi Yu
    Ju Gao
    [J]. Multimedia Tools and Applications, 2024, 83 (29) : 73391 - 73405
  • [25] Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos
    Ma, Miao
    Marturi, Naresh
    Li, Yibin
    Leonardis, Ales
    Stolkin, Rustam
    [J]. PATTERN RECOGNITION, 2018, 76 : 506 - 521