SSRT: A Sequential Skeleton RGB Transformer to Recognize Fine-Grained Human-Object Interactions and Action Recognition

被引:7
|
作者
Ghimire, Akash [1 ]
Kakani, Vijay [1 ]
Kim, Hakil [2 ]
机构
[1] Inha Univ, Sch Global Convergence Studies, Dept Integrated Syst Engn, Incheon 402751, South Korea
[2] Inha Univ, Dept Informat & Commun Engn, Incheon 402751, South Korea
关键词
Human factors; Transformers; Task analysis; Feature extraction; Human activity recognition; Solid modeling; Multimodality fusion; human action recognition; fine-grained actions; transformer cross-attention fusion;
D O I
10.1109/ACCESS.2023.3278974
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Combining skeleton and RGB modalities in human action recognition (HAR) has garnered attention due to their ability to complement each other. However, previous studies did not address the challenge of recognizing fine-grained human-object interaction (HOI). To tackle this problem, this study introduces a new transformer-based architecture called Sequential Skeleton RGB Transformer (SSRT), which fuses skeleton and RGB modalities. First, SSRT leverages the strength of Long Short-Term Memory (LSTM) and a multi-head attention mechanism to extract high-level features from both modalities. Subsequently, SSRT employs a two-stage fusion method, including transformer cross-attention fusion and softmax layer late score fusion, to effectively integrate the multimodal features. Aside from evaluating the proposed method on fine-grained HOI, this study also assesses its performance on two other action recognition tasks: general HAR and cross-dataset HAR. Furthermore, this study conducts a performance comparison between a HAR model using single-modality features (RGB and skeleton) alongside multimodality features on all three action recognition tasks. To ensure a fair comparison, comparable state-of-the-art transformer architectures are employed for both the single-modality HAR model and SSRT. In terms of modality, SSRT outperforms the best-performing single-modality HAR model on all three tasks, with accuracy improved by 9.92% on fine-grained HOI recognition, 6.73% on general HAR, and 11.08% on cross-dataset HAR. Additionally, the proposed fusion model surpasses state-of-the-art multimodal fusion techniques like Transformer Early Concatenation, with an accuracy improved by 6.32% on fine-grained HOI recognition, 4.04% on general HAR, and 6.56% on cross-dataset.
引用
收藏
页码:51930 / 51948
页数:19
相关论文
共 25 条
  • [1] Convolutional transformer network for fine-grained action recognition
    Ma, Yujun
    Wang, Ruili
    Zong, Ming
    Ji, Wanting
    Wang, Yi
    Ye, Baoliu
    [J]. NEUROCOMPUTING, 2024, 569
  • [2] FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection
    Ma, Shuailei
    Wang, Yuefeng
    Wang, Shanze
    Wei, Ying
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (04) : 2415 - 2429
  • [3] Fine-grained skeleton action recognition with pairwise motion salience learning
    Li, Hongyan
    Tu, Zhigang
    Xie, Wei
    Zhang, Jiaxu
    [J]. Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2023, 53 (12): : 2440 - 2457
  • [4] Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition
    Liu, Dichao
    Wang, Yu
    Kato, Jien
    [J]. VISAPP: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS, VOL 4, 2019, : 311 - 318
  • [5] Spatio-Temporal Human-Object Interactions for Action Recognition in Videos
    Escorcia, Victor
    Carlos Niebles, Juan
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2013, : 508 - 514
  • [6] JOINT LEARNING ON THE HIERARCHY REPRESENTATION FOR FINE-GRAINED HUMAN ACTION RECOGNITION
    Leong, Mei Chee
    Tan, Hui Li
    Zhang, Haosong
    Li, Liyuan
    Lin, Feng
    Lim, Joo Hwee
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1059 - 1063
  • [7] Human Action Recognition Using Deep Data: A Fine-Grained Study
    Rao, D. Surendra
    Potturu, Sudharsana Rao
    Bhagyaraju, V
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2022, 22 (06): : 97 - 108
  • [8] A Transformer-based Late-Fusion Mechanism for Fine-Grained Object Recognition in Videos
    Koch, Jannik
    Wolf, Stefan
    Beyerer, Juergen
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2023, : 100 - 109
  • [9] Topology-Embedded Temporal Attention for Fine-Grained Skeleton-Based Action Recognition
    Han, Pengyuan
    Ma, Zhongli
    Liu, Jiajia
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (16):
  • [10] Learning Effective Skeletal Representations on RGB Video for Fine-Grained Human Action Quality Assessment
    Lei, Qing
    Zhang, Hong-Bo
    Du, Ji-Xiang
    Hsiao, Tsung-Chih
    Chen, Chih-Cheng
    [J]. ELECTRONICS, 2020, 9 (04)