CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal Transformer

被引:0
|
作者
Liu, Suyi [1 ]
Xu, Fang [2 ]
Wu, Chengdong [1 ]
Chi, Jianning [1 ]
Yu, Xiaosheng [1 ]
Wei, Longxing [3 ]
Leng, Chuanjiang [1 ]
机构
[1] Northeastern Univ, Fac Robot Sci & Engn, Chuangxin Rd, Shenyang 110167, Liaoning, Peoples R China
[2] Acad Sinica, Shenyang Siasun Robot Automat Co Ltd, Quanyun Rd, Shenyang 110180, Liaoning, Peoples R China
[3] China Aerosp Sci & Ind Corp, Inst 706, Acad 2, Yongding Rd, Beijing 100049, Peoples R China
来源
VISUAL COMPUTER | 2025年 / 41卷 / 03期
基金
中国国家自然科学基金;
关键词
6D pose estimation; Cross-modal Transformer; Cross-modal key query strategy; 3D keypoint selection; Lightweight pose iterative; 3D OBJECT DETECTION;
D O I
10.1007/s00371-024-03520-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
6DoF pose estimation has received much attention in recent years. A key challenge is the difficulty of estimating object pose when the target texture is weak. In this work, we present the cross-modal Transformer (CMT-6D), a Transformer-based network suitable for highly accurate workpiece-level object 6D pose estimation from a single RGBD image. Our main insight is to make the surface texture information of RGB images with the geometric feature information of point clouds complement each other through a cross-modal Transformer, enabling accurate estimation of the pose of weakly textured targets. Specifically, the whole framework consists of two parallel Transformer branches, named Point Transformer and Image Transformer. Both parallel transformer networks use a pyramid structured encoder and a multi-layer perceptron structured decoder to extract geometric features of point clouds and texture features of RGB images, respectively. Then, a cross-modal key query strategy is proposed for information exchange between parallel channels. In addition, at the output representation stage, we design a simple and effective 3D keypoint selection algorithm to solve the problem that keypoints are likely to appear in the non-significant region. Finally, to improve the accuracy of attitude estimation and meet real-time requirements, a lightweight pose iterative network based on target feature regression is proposed to correct the initial attitude estimation error. Extensive experiments demonstrate the effectiveness and superiority of our method on LineMOD, Occlusion LineMOD, T-Less, and YCB-Video datasets. We demonstrate that our method can improve the 6D pose estimation performance by comparing with the state-of-the-art. Ablation research and visualization validate the design of CMT-6D.
引用
收藏
页码:2011 / 2027
页数:17
相关论文
共 50 条
  • [31] Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road
    Rangesh, Akshay
    Trivedi, Mohan Manubhai
    IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2020, 5 (03): : 449 - 460
  • [32] LHFF-Net: Local heterogeneous feature fusion network for 6DoF pose estimation
    Wang, Fei
    He, Zhenquan
    Zhang, Xing
    Jiang, Yong
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (10) : 2795 - 2807
  • [33] DCNet: Dense Correspondence Neural Network for 6DoF Object Pose Estimation in Occluded Scenes
    Chen, Zhi
    Yang, Wei
    Xu, Zhenbo
    Xie, Xike
    Huang, Liusheng
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3929 - 3937
  • [34] BOLD3D: A 3D BOLD descriptor for 6Dof pose estimation
    Zhou, Jun
    Liu, Yuanpeng
    Liu, Jinshan
    Xie, Qian
    Zhang, Yuqi
    Zhu, Xusheng
    Ding, Xiao
    COMPUTERS & GRAPHICS-UK, 2020, 89 : 94 - 104
  • [35] 6DoF Pose Estimation of Transparent Object from a Single RGB-D Image
    Xu, Chi
    Chen, Jiale
    Yao, Mengyang
    Zhou, Jun
    Zhang, Lijun
    Liu, Yi
    SENSORS, 2020, 20 (23) : 1 - 19
  • [36] 6DOF Needle Pose Estimation for Robot-Assisted Vitreoretinal Surgery
    Zhou, Mingchuan
    Hao, Xing
    Eslami, Abouzar
    Huang, Kai
    Cai, Caixia
    Lohmann, Chris P.
    Navab, Nassir
    Knoll, Alois
    Nasseri, M. Ali
    IEEE ACCESS, 2019, 7 : 63113 - 63122
  • [37] MLFNet: Monocular lifting fusion network for 6DoF texture-less object pose estimation
    Jiang, Junjie
    He, Zaixing
    Zhao, Xinyue
    Zhang, Shuyou
    Wu, Chenrui
    Wang, Yang
    NEUROCOMPUTING, 2022, 504 : 16 - 29
  • [38] Real-time scalable 6DOF pose estimation for textureless objects
    Cao, Zhe
    Sheikh, Yaser
    Banerjee, Natasha Kholgade
    2016 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2016, : 2441 - 2448
  • [39] ZebraPose: Coarse to Fine Surface Encoding for 6DoF Object Pose Estimation
    Su, Yongzhi
    Saleh, Mahdi
    Fetzer, Torben
    Rambach, Jason
    Navab, Nassir
    Busam, Benjamin
    Stricker, Didier
    Tombari, Federico
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 6728 - 6738
  • [40] Summarizing image/surface registration for 6DOF robot/camera pose estimation
    Batlle, Elisabet
    Matabosch, Carles
    Salvi, Joaquim
    PATTERN RECOGNITION AND IMAGE ANALYSIS, PT 2, PROCEEDINGS, 2007, 4478 : 105 - +