Dual-stream cross-modality fusion transformer for RGB-D action recognition

被引:22
|
作者
Liu, Zhen [1 ,2 ]
Cheng, Jun [1 ]
Liu, Libo [1 ,2 ]
Ren, Ziliang [1 ,3 ]
Zhang, Qieshi [1 ]
Song, Chengqun [1 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Prov Key Lab Robot & Intelligent Syst, Shenzhen 518055, Peoples R China
[2] Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing 100049, Peoples R China
[3] Dongguan Univ Technol, Sch Sci & Technol, Dongguan 523808, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; Multimodal fusion; Transformer; ConvNets; NEURAL-NETWORKS;
D O I
10.1016/j.knosys.2022.109741
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
RGB-D-based action recognition can achieve accurate and robust performance due to rich comple-mentary information, and thus has many application scenarios. However, existing works combine multiple modalities by late fusion or learn multimodal representation with simple feature-level fusion methods, which fail to effectively utilize complementary semantic information and model interactions between unimodal features. In this paper, we design a self-attention-based modal enhancement module (MEM) and a cross-attention-based modal interaction module (MIM) to enhance and fuse RGB and depth features. Moreover, a novel bottleneck excitation feed-forward block (BEF) is proposed to enhance the expression ability of the model with few extra parameters and computational overhead. By integrating these two modules with BEFs, one basic fusion layer of the cross-modality fusion transformer is obtained. We apply the transformer on top of the dual-stream convolutional neural networks (ConvNets) to build a dual-stream cross-modality fusion transformer (DSCMT) for RGB-D action recognition. Extensive experiments on the NTU RGB+D 120, PKU-MMD, and THU-READ datasets verify the effectiveness and superiority of the DSCMT. Furthermore, our DSCMT can still make considerable improvements when changing convolutional backbones or when applied to different multimodal combinations, indicating its universality and scalability. The code is available at https: //github.com/liuzwin98/DSCMT. (c) 2022 Published by Elsevier B.V.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Cross-Modality Compensation Convolutional Neural Networks for RGB-D Action Recognition
    Cheng, Jun
    Ren, Ziliang
    Zhang, Qieshi
    Gao, Xiangyang
    Hao, Fusheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 1498 - 1509
  • [2] CMOT: A cross-modality transformer for RGB-D fusion in person re-identification with online learning capabilities
    Mukhtar, Hamza
    Khan, Muhammad Usman Ghani
    KNOWLEDGE-BASED SYSTEMS, 2024, 283
  • [3] Feature Fusion for Dual-Stream Cooperative Action Recognition
    Chen, Dong
    Wu, Mengtao
    Zhang, Tao
    Li, Chuanqi
    IEEE ACCESS, 2023, 11 : 116732 - 116740
  • [4] Learning Discriminative Cross-Modality Features for RGB-D Saliency Detection
    Wang, Fengyun
    Pan, Jinshan
    Xu, Shoukun
    Tang, Jinhui
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1285 - 1297
  • [5] DGFNet: Depth-Guided Cross-Modality Fusion Network for RGB-D Salient Object Detection
    Xiao, Fen
    Pu, Zhengdong
    Chen, Jiaqi
    Gao, Xieping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2648 - 2658
  • [6] SlowFast Multimodality Compensation Fusion Swin Transformer Networks for RGB-D Action Recognition
    Xiao, Xiongjiang
    Ren, Ziliang
    Li, Huan
    Wei, Wenhong
    Yang, Zhiyong
    Yang, Huaide
    MATHEMATICS, 2023, 11 (09)
  • [7] RGB-D road segmentation based on cross-modality feature maintenance and encouragement
    Yuan, Xia
    Wu, Xinyi
    Cui, Yanchao
    Zhao, Chunxia
    IET INTELLIGENT TRANSPORT SYSTEMS, 2024, 18 (07) : 1355 - 1368
  • [8] DCMNet: Discriminant and cross-modality network for RGB-D salient object detection
    Wang, Fasheng
    Wang, Ruimin
    Sun, Fuming
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
  • [9] RGB-D Domain adaptive semantic segmentation with cross-modality feature recalibration
    Fan, Qizhe
    Shen, Xiaoqin
    Ying, Shihui
    Wang, Juan
    Du, Shaoyi
    INFORMATION FUSION, 2025, 120
  • [10] Asymmetric cross-modality interaction network for RGB-D salient object detection
    Su, Yiming
    Gao, Haoran
    Wang, Mengyin
    Wang, Fasheng
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 275