Cross-Modal Transformer for RGB-D semantic segmentation of production workshop objects

被引:5
|
作者
Ru, Qingjun [1 ]
Chen, Guangzhu [1 ]
Zuo, Tingyu [1 ]
Liao, Xiaojuan [1 ]
机构
[1] Chengdu Univ Technol, Coll Comp Sci & Cyber Secur, Chengdu, Peoples R China
关键词
Cross-Modal; Production workshop object; RGB-D; Semantic segmentation; Transformer;
D O I
10.1016/j.patcog.2023.109862
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scene understanding in a production workshop is an important technology to improve its intelligence level, semantic segmentation of production workshop objects is an effective method for realizing scene understanding. Since the varieties of information of production workshop, making full use of the complementary information of RGB image and depth image can effectively improve the semantic segmentation accuracy of production workshop objects. Aiming at solving the multi-scale and real-time problems of segmenting the production workshop objects, this paper proposes Cross-Modal Transformer (CMFormer), a Transformer-based cross-modal semantic segmentation model. Its key feature correction and feature fusion parts are composed of the Multi-Scale Channel Attention Correction(MS-CAC) module and the Global Feature Aggregation(GFA) module. By improving Multi Head Self-Attention(MHSA) in Transformer, we design Cross-Modal Multi-Head Self-Attention(CM-MHSA) to build long-range interaction between RGB image and depth image, and further design the MS-CAC module and the GFA module on the basis of the CM-MHSA module, to achieve cross-modal information interaction in the channel and spatial dimensions. Among them, the MS-CAC module enriches the multi-scale features of each channel and achieve more accurate channel attention correction between the two modals; the GFA module interacts with RGB feature and depth feature in the spatial dimension and fuses global and local features at the same time. In the experiments on the NYU Depth v2 dataset, the CMFormer reached 68.00% MPA(Mean Pixel Accuracy) and 55.75% mIoU(Mean Intersection over Union), achieves the state-of-the-art results. While in the experiments on the Scene Objects for Production workshop dataset(SOP), the CMFormer achieves 96.74% MPA, 92.98% mIoU and 43 FPS(Frames Per Second), which has high precision and good real-time performance. Code is available at: https://github.com/FutureIAI/CMFormer
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Cross-modal attention fusion network for RGB-D semantic segmentation
    Zhao, Qiankun
    Wan, Yingcai
    Xu, Jiqian
    Fang, Lijin
    NEUROCOMPUTING, 2023, 548
  • [2] UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation
    Ying, Xiaowen
    Chuah, Mooi Choo
    COMPUTER VISION - ECCV 2022, PT XXX, 2022, 13690 : 20 - 37
  • [3] A Cross-Modal Feature Fusion Model Based on ConvNeXt for RGB-D Semantic Segmentation
    Tang, Xiaojiang
    Li, Baoxia
    Guo, Junwei
    Chen, Wenzhuo
    Zhang, Dan
    Huang, Feng
    MATHEMATICS, 2023, 11 (08)
  • [4] Lightweight cross-modal transformer for RGB-D salient object detection
    Huang, Nianchang
    Yang, Yang
    Zhang, Qiang
    Han, Jungong
    Huang, Jin
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
  • [5] CMPFFNet: Cross-Modal and Progressive Feature Fusion Network for RGB-D Indoor Scene Semantic Segmentation
    Zhou, Wujie
    Xiao, Yuxiang
    Yan, Weiqing
    Yu, Lu
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, 21 (04) : 5523 - 5533
  • [6] CMPFFNet: Cross-Modal and Progressive Feature Fusion Network for RGB-D Indoor Scene Semantic Segmentation
    Zhou, Wujie
    Xiao, Yuxiang
    Yan, Weiqing
    Yu, Lu
    IEEE Transactions on Automation Science and Engineering, 2023, : 1 - 11
  • [7] Cross-Modal Adaptation for RGB-D Detection
    Hoffman, Judy
    Gupta, Saurabh
    Leong, Jian
    Guadarrama, Sergio
    Darrell, Trevor
    2016 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2016, : 5032 - 5039
  • [8] Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond
    Chen, Hao
    Shen, Feihong
    Ding, Ding
    Deng, Yongjian
    Li, Chao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1699 - 1709
  • [9] Transformer fusion for indoor RGB-D semantic segmentation
    Wu, Zongwei
    Zhou, Zhuyun
    Allibert, Guillaume
    Stolz, Christophe
    Demonceaux, Cedric
    Ma, Chao
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
  • [10] CLGFormer: Cross-Level-Guided transformer for RGB-D semantic segmentation
    Li T.
    Zhou Q.
    Wu D.
    Sun M.
    Hu T.
    Multimedia Tools and Applications, 2025, 84 (11) : 9447 - 9469