Dual-stream cross-modality fusion transformer for RGB-D action recognition

被引:22
|
作者
Liu, Zhen [1 ,2 ]
Cheng, Jun [1 ]
Liu, Libo [1 ,2 ]
Ren, Ziliang [1 ,3 ]
Zhang, Qieshi [1 ]
Song, Chengqun [1 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Prov Key Lab Robot & Intelligent Syst, Shenzhen 518055, Peoples R China
[2] Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing 100049, Peoples R China
[3] Dongguan Univ Technol, Sch Sci & Technol, Dongguan 523808, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; Multimodal fusion; Transformer; ConvNets; NEURAL-NETWORKS;
D O I
10.1016/j.knosys.2022.109741
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
RGB-D-based action recognition can achieve accurate and robust performance due to rich comple-mentary information, and thus has many application scenarios. However, existing works combine multiple modalities by late fusion or learn multimodal representation with simple feature-level fusion methods, which fail to effectively utilize complementary semantic information and model interactions between unimodal features. In this paper, we design a self-attention-based modal enhancement module (MEM) and a cross-attention-based modal interaction module (MIM) to enhance and fuse RGB and depth features. Moreover, a novel bottleneck excitation feed-forward block (BEF) is proposed to enhance the expression ability of the model with few extra parameters and computational overhead. By integrating these two modules with BEFs, one basic fusion layer of the cross-modality fusion transformer is obtained. We apply the transformer on top of the dual-stream convolutional neural networks (ConvNets) to build a dual-stream cross-modality fusion transformer (DSCMT) for RGB-D action recognition. Extensive experiments on the NTU RGB+D 120, PKU-MMD, and THU-READ datasets verify the effectiveness and superiority of the DSCMT. Furthermore, our DSCMT can still make considerable improvements when changing convolutional backbones or when applied to different multimodal combinations, indicating its universality and scalability. The code is available at https: //github.com/liuzwin98/DSCMT. (c) 2022 Published by Elsevier B.V.
引用
收藏
页数:11
相关论文
共 50 条
  • [11] Cross-modality Discrepant Interaction Network for RGB-D Salient Object Detection
    Zhang, Chen
    Cong, Runmin
    Lin, Qinwei
    Ma, Lin
    Li, Feng
    Zhao, Yao
    Kwong, Sam
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2094 - 2102
  • [12] Fusion of Skeleton and RGB Features for RGB-D Human Action Recognition
    Weiyao, Xu
    Muqing, Wu
    Min, Zhao
    Ting, Xia
    IEEE SENSORS JOURNAL, 2021, 21 (17) : 19157 - 19164
  • [13] Compressed Video Action Recognition With Dual-Stream and Dual-Modal Transformer
    Mou, Yuting
    Jiang, Xinghao
    Xu, Ke
    Sun, Tanfeng
    Wang, Zepeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3299 - 3312
  • [14] Indoor RGB-D Image Semantic Segmentation Based on Dual-Stream Weighted Gabor Convolutional Network Fusion
    Xuchu, Wang
    Huihuang, Liu
    Yanmin, Niu
    ACTA OPTICA SINICA, 2020, 40 (19)
  • [15] Double cross-modality progressively guided network for RGB-D salient object detection
    Yao, Cuili
    Feng, Lin
    Kong, Yuqiu
    Li, Shengming
    Li, Hang
    IMAGE AND VISION COMPUTING, 2022, 117
  • [16] MULTIMODAL FEATURE FUSION MODEL FOR RGB-D ACTION RECOGNITION
    Xu Weiyao
    Wu Muqing
    Zhao Min
    Xia Ting
    2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,
  • [17] Trear: Transformer-Based RGB-D Egocentric Action Recognition
    Li, Xiangyu
    Hou, Yonghong
    Wang, Pichao
    Gao, Zhimin
    Xu, Mingliang
    Li, Wanqing
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2022, 14 (01) : 246 - 252
  • [18] Dual-Stream Fusion Network with ConvNeXtV2 for Pig Weight Estimation Using RGB-D Data in Aisles
    Tan, Zujie
    Liu, Junbin
    Xiao, Deqin
    Liu, Youfu
    Huang, Yigui
    ANIMALS, 2023, 13 (24):
  • [19] SwinTFNet: Dual-Stream Transformer With Cross Attention Fusion for Land Cover Classification
    Ren, Bo
    Liu, Bo
    Hou, Biao
    Wang, Zhao
    Yang, Chen
    Jiao, Licheng
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
  • [20] CIR-Net: Cross-Modality Interaction and Refinement for RGB-D Salient Object Detection
    Cong, Runmin
    Lin, Qinwei
    Zhang, Chen
    Li, Chongyi
    Cao, Xiaochun
    Huang, Qingming
    Zhao, Yao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 6800 - 6815