Cross-Modal Transformer for RGB-D semantic segmentation of production workshop objects

被引：5

作者：

Ru, Qingjun ^{[1
]}

Chen, Guangzhu ^{[1
]}

Zuo, Tingyu ^{[1
]}

Liao, Xiaojuan ^{[1
]}

机构：

[1] Chengdu Univ Technol, Coll Comp Sci & Cyber Secur, Chengdu, Peoples R China

来源：

PATTERN RECOGNITION | 2023年 / 144卷

关键词：

Cross-Modal; Production workshop object; RGB-D; Semantic segmentation; Transformer;

D O I：

10.1016/j.patcog.2023.109862

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Scene understanding in a production workshop is an important technology to improve its intelligence level, semantic segmentation of production workshop objects is an effective method for realizing scene understanding. Since the varieties of information of production workshop, making full use of the complementary information of RGB image and depth image can effectively improve the semantic segmentation accuracy of production workshop objects. Aiming at solving the multi-scale and real-time problems of segmenting the production workshop objects, this paper proposes Cross-Modal Transformer (CMFormer), a Transformer-based cross-modal semantic segmentation model. Its key feature correction and feature fusion parts are composed of the Multi-Scale Channel Attention Correction(MS-CAC) module and the Global Feature Aggregation(GFA) module. By improving Multi Head Self-Attention(MHSA) in Transformer, we design Cross-Modal Multi-Head Self-Attention(CM-MHSA) to build long-range interaction between RGB image and depth image, and further design the MS-CAC module and the GFA module on the basis of the CM-MHSA module, to achieve cross-modal information interaction in the channel and spatial dimensions. Among them, the MS-CAC module enriches the multi-scale features of each channel and achieve more accurate channel attention correction between the two modals; the GFA module interacts with RGB feature and depth feature in the spatial dimension and fuses global and local features at the same time. In the experiments on the NYU Depth v2 dataset, the CMFormer reached 68.00% MPA(Mean Pixel Accuracy) and 55.75% mIoU(Mean Intersection over Union), achieves the state-of-the-art results. While in the experiments on the Scene Objects for Production workshop dataset(SOP), the CMFormer achieves 96.74% MPA, 92.98% mIoU and 43 FPS(Frames Per Second), which has high precision and good real-time performance. Code is available at: https://github.com/FutureIAI/CMFormer

引用

页数：13

共 50 条

[1] Cross-modal attention fusion network for RGB-D semantic segmentation
Zhao, Qiankun
Wan, Yingcai
Xu, Jiqian
Fang, Lijin
NEUROCOMPUTING, 2023, 548
[2] UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation
Ying, Xiaowen
Chuah, Mooi Choo
COMPUTER VISION - ECCV 2022, PT XXX, 2022, 13690 : 20 - 37
[3] A Cross-Modal Feature Fusion Model Based on ConvNeXt for RGB-D Semantic Segmentation
Tang, Xiaojiang
Li, Baoxia
Guo, Junwei
Chen, Wenzhuo
Zhang, Dan
Huang, Feng
MATHEMATICS, 2023, 11 (08)
[4] Lightweight cross-modal transformer for RGB-D salient object detection
Huang, Nianchang
Yang, Yang
Zhang, Qiang
Han, Jungong
Huang, Jin
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
[5] CMPFFNet: Cross-Modal and Progressive Feature Fusion Network for RGB-D Indoor Scene Semantic Segmentation
Zhou, Wujie
Xiao, Yuxiang
Yan, Weiqing
Yu, Lu
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, 21 (04) : 5523 - 5533
[6] CMPFFNet: Cross-Modal and Progressive Feature Fusion Network for RGB-D Indoor Scene Semantic Segmentation
Zhou, Wujie
Xiao, Yuxiang
Yan, Weiqing
Yu, Lu
IEEE Transactions on Automation Science and Engineering, 2023, : 1 - 11
[7] Cross-Modal Adaptation for RGB-D Detection
Hoffman, Judy
Gupta, Saurabh
Leong, Jian
Guadarrama, Sergio
Darrell, Trevor
2016 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2016, : 5032 - 5039
[8] Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond
Chen, Hao
Shen, Feihong
Ding, Ding
Deng, Yongjian
Li, Chao
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1699 - 1709
[9] Transformer fusion for indoor RGB-D semantic segmentation
Wu, Zongwei
Zhou, Zhuyun
Allibert, Guillaume
Stolz, Christophe
Demonceaux, Cedric
Ma, Chao
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
[10] CLGFormer: Cross-Level-Guided transformer for RGB-D semantic segmentation
Li T.
Zhou Q.
Wu D.
Sun M.
Hu T.
Multimedia Tools and Applications, 2025, 84 (11) : 9447 - 9469

← 1 2 3 4 5 →