Cross-Modal Transformer for RGB-D semantic segmentation of production workshop objects

被引:5
|
作者
Ru, Qingjun [1 ]
Chen, Guangzhu [1 ]
Zuo, Tingyu [1 ]
Liao, Xiaojuan [1 ]
机构
[1] Chengdu Univ Technol, Coll Comp Sci & Cyber Secur, Chengdu, Peoples R China
关键词
Cross-Modal; Production workshop object; RGB-D; Semantic segmentation; Transformer;
D O I
10.1016/j.patcog.2023.109862
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scene understanding in a production workshop is an important technology to improve its intelligence level, semantic segmentation of production workshop objects is an effective method for realizing scene understanding. Since the varieties of information of production workshop, making full use of the complementary information of RGB image and depth image can effectively improve the semantic segmentation accuracy of production workshop objects. Aiming at solving the multi-scale and real-time problems of segmenting the production workshop objects, this paper proposes Cross-Modal Transformer (CMFormer), a Transformer-based cross-modal semantic segmentation model. Its key feature correction and feature fusion parts are composed of the Multi-Scale Channel Attention Correction(MS-CAC) module and the Global Feature Aggregation(GFA) module. By improving Multi Head Self-Attention(MHSA) in Transformer, we design Cross-Modal Multi-Head Self-Attention(CM-MHSA) to build long-range interaction between RGB image and depth image, and further design the MS-CAC module and the GFA module on the basis of the CM-MHSA module, to achieve cross-modal information interaction in the channel and spatial dimensions. Among them, the MS-CAC module enriches the multi-scale features of each channel and achieve more accurate channel attention correction between the two modals; the GFA module interacts with RGB feature and depth feature in the spatial dimension and fuses global and local features at the same time. In the experiments on the NYU Depth v2 dataset, the CMFormer reached 68.00% MPA(Mean Pixel Accuracy) and 55.75% mIoU(Mean Intersection over Union), achieves the state-of-the-art results. While in the experiments on the Scene Objects for Production workshop dataset(SOP), the CMFormer achieves 96.74% MPA, 92.98% mIoU and 43 FPS(Frames Per Second), which has high precision and good real-time performance. Code is available at: https://github.com/FutureIAI/CMFormer
引用
收藏
页数:13
相关论文
共 50 条
  • [41] RGB-D Salient Object Detection Based on Cross-Modal and Cross-Level Feature Fusion
    Peng, Yanbin
    Zhai, Zhinian
    Feng, Mingkun
    IEEE ACCESS, 2024, 12 : 45134 - 45146
  • [42] Global Guided Cross-Modal Cross-Scale Network for RGB-D Salient Object Detection
    Wang, Shuaihui
    Jiang, Fengyi
    Xu, Boqian
    SENSORS, 2023, 23 (16)
  • [43] RGB-D Salient Object Detection Based on Cross-Modal and Cross-Level Feature Fusion
    Peng, Yanbin
    Zhai, Zhinian
    Feng, Mingkun
    IEEE Access, 2024, 12 : 45134 - 45146
  • [44] DEPTH REMOVAL DISTILLATION FOR RGB-D SEMANTIC SEGMENTATION
    Fang, Tiyu
    Liang, Zhen
    Shao, Xiuli
    Dong, Zihao
    Li, Jinping
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2405 - 2409
  • [45] RGB-D Saliency Detection based on Cross-Modal and Multi-scale Feature Fusion
    Zhu, Xuxing
    Wu, Jin
    Zhu, Lei
    2022 34TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2022, : 6154 - 6160
  • [46] Cross-modal refined adjacent-guided network for RGB-D salient object detection
    Bi H.
    Zhang J.
    Wu R.
    Tong Y.
    Jin W.
    Multimedia Tools Appl, 24 (37453-37478): : 37453 - 37478
  • [47] RGB-D Salient Object Detection Based on Cross-modal Interactive Fusion and Global Awareness
    Sun F.-M.
    Hu X.-H.
    Wu J.-Y.
    Sun J.
    Wang F.-S.
    Ruan Jian Xue Bao/Journal of Software, 2024, 35 (04): : 1899 - 1913
  • [48] BCINet: Bilateral cross-modal interaction network for indoor scene understanding in RGB-D images
    Zhou, Wujie
    Yue, Yuchun
    Fang, Meixin
    Qian, Xiaohong
    Yang, Rongwang
    Yu, Lu
    INFORMATION FUSION, 2023, 94 : 32 - 42
  • [49] Multi-level cross-modal interaction network for RGB-D salient object detection
    Huang, Zhou
    Chen, Huai-Xin
    Zhou, Tao
    Yang, Yun-Zhi
    Liu, Bi-Yuan
    NEUROCOMPUTING, 2021, 452 : 200 - 211
  • [50] Intermediary-Generated Bridge Network for RGB-D Cross-Modal Re-Identification
    Wu, Jingjing
    Hong, Richang
    Tang, Shengeng
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (06)