Dual-stream cross-modality fusion transformer for RGB-D action recognition

被引：22

作者：

Liu, Zhen ^{[1
,2
]}

Cheng, Jun ^{[1
]}

Liu, Libo ^{[1
,2
]}

Ren, Ziliang ^{[1
,3
]}

Zhang, Qieshi ^{[1
]}

Song, Chengqun ^{[1
]}

机构：

[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Prov Key Lab Robot & Intelligent Syst, Shenzhen 518055, Peoples R China

[2] Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing 100049, Peoples R China

[3] Dongguan Univ Technol, Sch Sci & Technol, Dongguan 523808, Peoples R China

来源：

KNOWLEDGE-BASED SYSTEMS | 2022年 / 255卷

基金：

中国国家自然科学基金;

关键词：

Action recognition; Multimodal fusion; Transformer; ConvNets; NEURAL-NETWORKS;

D O I：

10.1016/j.knosys.2022.109741

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

RGB-D-based action recognition can achieve accurate and robust performance due to rich comple-mentary information, and thus has many application scenarios. However, existing works combine multiple modalities by late fusion or learn multimodal representation with simple feature-level fusion methods, which fail to effectively utilize complementary semantic information and model interactions between unimodal features. In this paper, we design a self-attention-based modal enhancement module (MEM) and a cross-attention-based modal interaction module (MIM) to enhance and fuse RGB and depth features. Moreover, a novel bottleneck excitation feed-forward block (BEF) is proposed to enhance the expression ability of the model with few extra parameters and computational overhead. By integrating these two modules with BEFs, one basic fusion layer of the cross-modality fusion transformer is obtained. We apply the transformer on top of the dual-stream convolutional neural networks (ConvNets) to build a dual-stream cross-modality fusion transformer (DSCMT) for RGB-D action recognition. Extensive experiments on the NTU RGB+D 120, PKU-MMD, and THU-READ datasets verify the effectiveness and superiority of the DSCMT. Furthermore, our DSCMT can still make considerable improvements when changing convolutional backbones or when applied to different multimodal combinations, indicating its universality and scalability. The code is available at https: //github.com/liuzwin98/DSCMT. (c) 2022 Published by Elsevier B.V.

引用

页数：11

共 50 条

[41] Recurrent Convolutional Fusion for RGB-D Object Recognition
Loghmani, Mohammad Reza
Planamente, Mirco
Caputo, Barbara
Vincze, Markus
IEEE ROBOTICS AND AUTOMATION LETTERS, 2019, 4 (03) : 2878 - 2885
[42] Facial Expression Recognition Through Cross-Modality Attention Fusion
Ni, Rongrong
Yang, Biao
Zhou, Xu
Cangelosi, Angelo
Liu, Xiaofeng
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (01) : 175 - 185
[43] Dual-stream encoded fusion saliency detection based on RGB and grayscale images
Tao Xu
Weishuo Zhao
Haojie Chai
Lei Cai
Multimedia Tools and Applications, 2023, 82 : 47327 - 47346
[44] A Video Action Recognition Method via Dual-Stream Feature Fusion Neural Network with Attention
Han, Jianmin
Li, Jie
INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2024, 32 (04) : 673 - 694
[45] Deep Bilinear Learning for RGB-D Action Recognition
Hu, Jian-Fang
Zheng, Wei-Shi
Pan, Jiahui
Lai, Jianhuang
Zhang, Jianguo
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 : 346 - 362
[46] Joint Deep Learning for RGB-D Action Recognition
Qin, Xiaolei
Ge, Yongxin
Zhan, Liuwei
Li, Guangrui
Huang, Sheng
Wang, Hongxing
Chen, Feiyu
Wang, Hongxing
2018 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (IEEE VCIP), 2018,
[47] RGB-D action recognition using linear coding
Liu, Huaping
Yuan, Mingyi
Sun, Fuchun
NEUROCOMPUTING, 2015, 149 : 79 - 85
[48] Viewpoint Invariant RGB-D Human Action Recognition
Liu, Jian
Akhtar, Naveed
Mian, Ajmal
2017 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING - TECHNIQUES AND APPLICATIONS (DICTA), 2017, : 261 - 268
[49] MoADNet: Mobile Asymmetric Dual-Stream Networks for Real-Time and Lightweight RGB-D Salient Object Detection
Jin, Xiao
Yi, Kang
Xu, Jing
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) : 7632 - 7645
[50] Modality and Component Aware Feature Fusion for RGB-D Scene Classification
Wang, Anran
Cai, Jianfei
Lu, Jiwen
Cham, Tat-Jen
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 5995 - 6004

← 1 2 3 4 5 →