Dual-stream cross-modality fusion transformer for RGB-D action recognition

被引:22
|
作者
Liu, Zhen [1 ,2 ]
Cheng, Jun [1 ]
Liu, Libo [1 ,2 ]
Ren, Ziliang [1 ,3 ]
Zhang, Qieshi [1 ]
Song, Chengqun [1 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Prov Key Lab Robot & Intelligent Syst, Shenzhen 518055, Peoples R China
[2] Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing 100049, Peoples R China
[3] Dongguan Univ Technol, Sch Sci & Technol, Dongguan 523808, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; Multimodal fusion; Transformer; ConvNets; NEURAL-NETWORKS;
D O I
10.1016/j.knosys.2022.109741
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
RGB-D-based action recognition can achieve accurate and robust performance due to rich comple-mentary information, and thus has many application scenarios. However, existing works combine multiple modalities by late fusion or learn multimodal representation with simple feature-level fusion methods, which fail to effectively utilize complementary semantic information and model interactions between unimodal features. In this paper, we design a self-attention-based modal enhancement module (MEM) and a cross-attention-based modal interaction module (MIM) to enhance and fuse RGB and depth features. Moreover, a novel bottleneck excitation feed-forward block (BEF) is proposed to enhance the expression ability of the model with few extra parameters and computational overhead. By integrating these two modules with BEFs, one basic fusion layer of the cross-modality fusion transformer is obtained. We apply the transformer on top of the dual-stream convolutional neural networks (ConvNets) to build a dual-stream cross-modality fusion transformer (DSCMT) for RGB-D action recognition. Extensive experiments on the NTU RGB+D 120, PKU-MMD, and THU-READ datasets verify the effectiveness and superiority of the DSCMT. Furthermore, our DSCMT can still make considerable improvements when changing convolutional backbones or when applied to different multimodal combinations, indicating its universality and scalability. The code is available at https: //github.com/liuzwin98/DSCMT. (c) 2022 Published by Elsevier B.V.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Child Action Recognition in RGB and RGB-D Data
    Turarova, Aizada
    Zhanatkyzy, Aida
    Telisheva, Zhansaule
    Sabyrov, Arman
    Sandygulova, Anara
    HRI'20: COMPANION OF THE 2020 ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN-ROBOT INTERACTION, 2020, : 491 - 492
  • [22] RGB-D Human Action Recognition of Deep Feature Enhancement and Fusion Using Two-Stream ConvNet
    Liu, Yun
    Ma, Ruidi
    Li, Hui
    Wang, Chuanxu
    Tao, Ye
    JOURNAL OF SENSORS, 2021, 2021
  • [23] Absolute Camera Pose Regression Using an RGB-D Dual-Stream Network and Handcrafted Base Poses
    Kao, Peng-Yuan
    Zhang, Rong-Rong
    Chen, Timothy
    Hung, Yi-Ping
    SENSORS, 2022, 22 (18)
  • [24] FCMNet: Frequency-aware cross-modality attention networks for RGB-D salient object detection
    Jin, Xiao
    Guo, Chunle
    He, Zhen
    Xu, Jing
    Wang, Yongwei
    Su, Yuting
    NEUROCOMPUTING, 2022, 491 : 414 - 425
  • [25] Transformer fusion for indoor RGB-D semantic segmentation
    Wu, Zongwei
    Zhou, Zhuyun
    Allibert, Guillaume
    Stolz, Christophe
    Demonceaux, Cedric
    Ma, Chao
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
  • [26] An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network
    Tellai M.
    Gao L.
    Mao Q.
    International Journal of Speech Technology, 2023, 26 (02) : 541 - 557
  • [27] Infrared and 3D Skeleton Feature Fusion for RGB-D Action Recognition
    De Boissiere, Alban Main
    Noumeir, Rita
    IEEE ACCESS, 2020, 8 (08): : 168297 - 168308
  • [28] MULTI-MODAL FEATURE FUSION FOR ACTION RECOGNITION IN RGB-D SEQUENCES
    Shahroudy, Amir
    Wang, Gang
    Ng, Tian-Tsong
    2014 6TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS, CONTROL AND SIGNAL PROCESSING (ISCCSP), 2014, : 73 - 76
  • [29] Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition
    Javed Imran
    Balasubramanian Raman
    Journal of Ambient Intelligence and Humanized Computing, 2020, 11 : 189 - 208
  • [30] ACTION RECOGNITION IN RGB-D EGOCENTRIC VIDEOS
    Tang, Yansong
    Tian, Yi
    Lu, Jiwen
    Feng, Jianjiang
    Zhou, Jie
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 3410 - 3414