Cross-Modal Transformer for RGB-D semantic segmentation of production workshop objects

被引:5
|
作者
Ru, Qingjun [1 ]
Chen, Guangzhu [1 ]
Zuo, Tingyu [1 ]
Liao, Xiaojuan [1 ]
机构
[1] Chengdu Univ Technol, Coll Comp Sci & Cyber Secur, Chengdu, Peoples R China
关键词
Cross-Modal; Production workshop object; RGB-D; Semantic segmentation; Transformer;
D O I
10.1016/j.patcog.2023.109862
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scene understanding in a production workshop is an important technology to improve its intelligence level, semantic segmentation of production workshop objects is an effective method for realizing scene understanding. Since the varieties of information of production workshop, making full use of the complementary information of RGB image and depth image can effectively improve the semantic segmentation accuracy of production workshop objects. Aiming at solving the multi-scale and real-time problems of segmenting the production workshop objects, this paper proposes Cross-Modal Transformer (CMFormer), a Transformer-based cross-modal semantic segmentation model. Its key feature correction and feature fusion parts are composed of the Multi-Scale Channel Attention Correction(MS-CAC) module and the Global Feature Aggregation(GFA) module. By improving Multi Head Self-Attention(MHSA) in Transformer, we design Cross-Modal Multi-Head Self-Attention(CM-MHSA) to build long-range interaction between RGB image and depth image, and further design the MS-CAC module and the GFA module on the basis of the CM-MHSA module, to achieve cross-modal information interaction in the channel and spatial dimensions. Among them, the MS-CAC module enriches the multi-scale features of each channel and achieve more accurate channel attention correction between the two modals; the GFA module interacts with RGB feature and depth feature in the spatial dimension and fuses global and local features at the same time. In the experiments on the NYU Depth v2 dataset, the CMFormer reached 68.00% MPA(Mean Pixel Accuracy) and 55.75% mIoU(Mean Intersection over Union), achieves the state-of-the-art results. While in the experiments on the Scene Objects for Production workshop dataset(SOP), the CMFormer achieves 96.74% MPA, 92.98% mIoU and 43 FPS(Frames Per Second), which has high precision and good real-time performance. Code is available at: https://github.com/FutureIAI/CMFormer
引用
收藏
页数:13
相关论文
共 50 条
  • [21] CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers
    Zhang, Jiaming
    Liu, Huayao
    Yang, Kailun
    Hu, Xinxin
    Liu, Ruiping
    Stiefelhagen, Rainer
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2023, 24 (12) : 14679 - 14694
  • [22] Intra-inter Modal Attention Blocks for RGB-D Semantic Segmentation
    Choi, Soyun
    Zhang, Youjia
    Hong, Sungeun
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 217 - 225
  • [23] Cross-modal hierarchical interaction network for RGB-D salient object detection
    Bi, Hongbo
    Wu, Ranwan
    Liu, Ziqi
    Zhu, Huihui
    Zhang, Cong
    Xiang, Tian -Zhu
    PATTERN RECOGNITION, 2023, 136
  • [24] CCANet: A Collaborative Cross-Modal Attention Network for RGB-D Crowd Counting
    Liu, Yanbo
    Cao, Guo
    Shi, Boshan
    Hu, Yingxiang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 154 - 165
  • [25] Joint Cross-Modal and Unimodal Features for RGB-D Salient Object Detection
    Huang, Nianchang
    Liu, Yi
    Zhang, Qiang
    Han, Jungong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2428 - 2441
  • [26] Transformer-Based Cross-Modal Information Fusion Network for Semantic Segmentation
    Duan, Zaipeng
    Huang, Xiao
    Ma, Jie
    NEURAL PROCESSING LETTERS, 2023, 55 (05) : 6361 - 6375
  • [27] Automatic objects segmentation with RGB-D cameras
    Liu, Haowei
    Philipose, Matthai
    Sun, Ming-Ting
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2014, 25 (04) : 709 - 718
  • [28] Transformer-Based Cross-Modal Information Fusion Network for Semantic Segmentation
    Zaipeng Duan
    Xiao Huang
    Jie Ma
    Neural Processing Letters, 2023, 55 : 6361 - 6375
  • [29] Joining geometric and RGB features for RGB-D semantic segmentation
    Zhang, Shaopeng
    Zhong, Min
    Zeng, Gang
    Gan, Rui
    2019 INTERNATIONAL CONFERENCE ON IMAGE AND VIDEO PROCESSING, AND ARTIFICIAL INTELLIGENCE, 2019, 11321
  • [30] RGB-D Grasp Detection via Depth Guided Learning with Cross-modal Attention
    Qin, Ran
    Ma, Haoxiang
    Ciao, Boyang
    Huang, Di
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 8003 - 8009