Dual-stream cross-modality fusion transformer for RGB-D action recognition

被引:22
|
作者
Liu, Zhen [1 ,2 ]
Cheng, Jun [1 ]
Liu, Libo [1 ,2 ]
Ren, Ziliang [1 ,3 ]
Zhang, Qieshi [1 ]
Song, Chengqun [1 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Prov Key Lab Robot & Intelligent Syst, Shenzhen 518055, Peoples R China
[2] Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing 100049, Peoples R China
[3] Dongguan Univ Technol, Sch Sci & Technol, Dongguan 523808, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; Multimodal fusion; Transformer; ConvNets; NEURAL-NETWORKS;
D O I
10.1016/j.knosys.2022.109741
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
RGB-D-based action recognition can achieve accurate and robust performance due to rich comple-mentary information, and thus has many application scenarios. However, existing works combine multiple modalities by late fusion or learn multimodal representation with simple feature-level fusion methods, which fail to effectively utilize complementary semantic information and model interactions between unimodal features. In this paper, we design a self-attention-based modal enhancement module (MEM) and a cross-attention-based modal interaction module (MIM) to enhance and fuse RGB and depth features. Moreover, a novel bottleneck excitation feed-forward block (BEF) is proposed to enhance the expression ability of the model with few extra parameters and computational overhead. By integrating these two modules with BEFs, one basic fusion layer of the cross-modality fusion transformer is obtained. We apply the transformer on top of the dual-stream convolutional neural networks (ConvNets) to build a dual-stream cross-modality fusion transformer (DSCMT) for RGB-D action recognition. Extensive experiments on the NTU RGB+D 120, PKU-MMD, and THU-READ datasets verify the effectiveness and superiority of the DSCMT. Furthermore, our DSCMT can still make considerable improvements when changing convolutional backbones or when applied to different multimodal combinations, indicating its universality and scalability. The code is available at https: //github.com/liuzwin98/DSCMT. (c) 2022 Published by Elsevier B.V.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Recurrent Convolutional Fusion for RGB-D Object Recognition
    Loghmani, Mohammad Reza
    Planamente, Mirco
    Caputo, Barbara
    Vincze, Markus
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2019, 4 (03) : 2878 - 2885
  • [42] Facial Expression Recognition Through Cross-Modality Attention Fusion
    Ni, Rongrong
    Yang, Biao
    Zhou, Xu
    Cangelosi, Angelo
    Liu, Xiaofeng
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (01) : 175 - 185
  • [43] Dual-stream encoded fusion saliency detection based on RGB and grayscale images
    Tao Xu
    Weishuo Zhao
    Haojie Chai
    Lei Cai
    Multimedia Tools and Applications, 2023, 82 : 47327 - 47346
  • [44] A Video Action Recognition Method via Dual-Stream Feature Fusion Neural Network with Attention
    Han, Jianmin
    Li, Jie
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2024, 32 (04) : 673 - 694
  • [45] Deep Bilinear Learning for RGB-D Action Recognition
    Hu, Jian-Fang
    Zheng, Wei-Shi
    Pan, Jiahui
    Lai, Jianhuang
    Zhang, Jianguo
    COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 : 346 - 362
  • [46] Joint Deep Learning for RGB-D Action Recognition
    Qin, Xiaolei
    Ge, Yongxin
    Zhan, Liuwei
    Li, Guangrui
    Huang, Sheng
    Wang, Hongxing
    Chen, Feiyu
    Wang, Hongxing
    2018 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (IEEE VCIP), 2018,
  • [47] RGB-D action recognition using linear coding
    Liu, Huaping
    Yuan, Mingyi
    Sun, Fuchun
    NEUROCOMPUTING, 2015, 149 : 79 - 85
  • [48] Viewpoint Invariant RGB-D Human Action Recognition
    Liu, Jian
    Akhtar, Naveed
    Mian, Ajmal
    2017 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING - TECHNIQUES AND APPLICATIONS (DICTA), 2017, : 261 - 268
  • [49] MoADNet: Mobile Asymmetric Dual-Stream Networks for Real-Time and Lightweight RGB-D Salient Object Detection
    Jin, Xiao
    Yi, Kang
    Xu, Jing
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) : 7632 - 7645
  • [50] Modality and Component Aware Feature Fusion for RGB-D Scene Classification
    Wang, Anran
    Cai, Jianfei
    Lu, Jiwen
    Cham, Tat-Jen
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 5995 - 6004