Dual-stream cross-modality fusion transformer for RGB-D action recognition

被引:22
|
作者
Liu, Zhen [1 ,2 ]
Cheng, Jun [1 ]
Liu, Libo [1 ,2 ]
Ren, Ziliang [1 ,3 ]
Zhang, Qieshi [1 ]
Song, Chengqun [1 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Prov Key Lab Robot & Intelligent Syst, Shenzhen 518055, Peoples R China
[2] Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing 100049, Peoples R China
[3] Dongguan Univ Technol, Sch Sci & Technol, Dongguan 523808, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; Multimodal fusion; Transformer; ConvNets; NEURAL-NETWORKS;
D O I
10.1016/j.knosys.2022.109741
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
RGB-D-based action recognition can achieve accurate and robust performance due to rich comple-mentary information, and thus has many application scenarios. However, existing works combine multiple modalities by late fusion or learn multimodal representation with simple feature-level fusion methods, which fail to effectively utilize complementary semantic information and model interactions between unimodal features. In this paper, we design a self-attention-based modal enhancement module (MEM) and a cross-attention-based modal interaction module (MIM) to enhance and fuse RGB and depth features. Moreover, a novel bottleneck excitation feed-forward block (BEF) is proposed to enhance the expression ability of the model with few extra parameters and computational overhead. By integrating these two modules with BEFs, one basic fusion layer of the cross-modality fusion transformer is obtained. We apply the transformer on top of the dual-stream convolutional neural networks (ConvNets) to build a dual-stream cross-modality fusion transformer (DSCMT) for RGB-D action recognition. Extensive experiments on the NTU RGB+D 120, PKU-MMD, and THU-READ datasets verify the effectiveness and superiority of the DSCMT. Furthermore, our DSCMT can still make considerable improvements when changing convolutional backbones or when applied to different multimodal combinations, indicating its universality and scalability. The code is available at https: //github.com/liuzwin98/DSCMT. (c) 2022 Published by Elsevier B.V.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition
    Imran, Javed
    Raman, Balasubramanian
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2020, 11 (01) : 189 - 208
  • [32] Structured Images for RGB-D Action Recognition
    Wang, Pichao
    Wang, Shuang
    Gao, Zhimin
    Hou, Yonghong
    Li, Wanqing
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 1005 - 1014
  • [33] Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition
    Tang, Yansong
    Wang, Zian
    Lu, Jiwen
    Feng, Jianjiang
    Zhou, Jie
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (10) : 3001 - 3015
  • [34] Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion
    Yu, Shaode
    Meng, Jiajian
    Fan, Wenqing
    Chen, Ye
    Zhu, Bing
    Yu, Hang
    Xie, Yaoqin
    Sun, Qiuirui
    ELECTRONICS, 2024, 13 (11)
  • [35] Audio-Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model
    Lee, Yong-Hyeok
    Jang, Dong-Won
    Kim, Jae-Bin
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 18
  • [36] Spatial-Temporal Information Aggregation and Cross-Modality Interactive Learning for RGB-D-Based Human Action Recognition
    Cheng, Qin
    Liu, Zhen
    Ren, Ziliang
    Cheng, Jun
    Liu, Jianming
    IEEE ACCESS, 2022, 10 : 104190 - 104201
  • [37] MSN: Modality separation networks for RGB-D scene recognition
    Xiong, Zhitong
    Yuan, Yuan
    Wang, Qi
    NEUROCOMPUTING, 2020, 373 : 81 - 89
  • [38] Dual-stream encoded fusion saliency detection based on RGB and grayscale images
    Xu, Tao
    Zhao, Weishuo
    Chai, Haojie
    Cai, Lei
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (30) : 47327 - 47346
  • [39] Periocular Recognition in the Wild: Implementation of RGB-OCLBCP Dual-Stream CNN
    Tiong, Leslie Ching Ow
    Lee, Yunli
    Teoh, Andrew Beng Jin
    APPLIED SCIENCES-BASEL, 2019, 9 (13):
  • [40] A Complementary Fusion Strategy for RGB-D Face Recognition
    Zheng, Haoyuan
    Wang, Weihang
    Wen, Fei
    Liu, Peilin
    MULTIMEDIA MODELING (MMM 2022), PT I, 2022, 13141 : 339 - 351