CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal Transformer

被引:0
|
作者
Liu, Suyi [1 ]
Xu, Fang [2 ]
Wu, Chengdong [1 ]
Chi, Jianning [1 ]
Yu, Xiaosheng [1 ]
Wei, Longxing [3 ]
Leng, Chuanjiang [1 ]
机构
[1] Northeastern Univ, Fac Robot Sci & Engn, Chuangxin Rd, Shenyang 110167, Liaoning, Peoples R China
[2] Acad Sinica, Shenyang Siasun Robot Automat Co Ltd, Quanyun Rd, Shenyang 110180, Liaoning, Peoples R China
[3] China Aerosp Sci & Ind Corp, Inst 706, Acad 2, Yongding Rd, Beijing 100049, Peoples R China
来源
VISUAL COMPUTER | 2025年 / 41卷 / 03期
基金
中国国家自然科学基金;
关键词
6D pose estimation; Cross-modal Transformer; Cross-modal key query strategy; 3D keypoint selection; Lightweight pose iterative; 3D OBJECT DETECTION;
D O I
10.1007/s00371-024-03520-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
6DoF pose estimation has received much attention in recent years. A key challenge is the difficulty of estimating object pose when the target texture is weak. In this work, we present the cross-modal Transformer (CMT-6D), a Transformer-based network suitable for highly accurate workpiece-level object 6D pose estimation from a single RGBD image. Our main insight is to make the surface texture information of RGB images with the geometric feature information of point clouds complement each other through a cross-modal Transformer, enabling accurate estimation of the pose of weakly textured targets. Specifically, the whole framework consists of two parallel Transformer branches, named Point Transformer and Image Transformer. Both parallel transformer networks use a pyramid structured encoder and a multi-layer perceptron structured decoder to extract geometric features of point clouds and texture features of RGB images, respectively. Then, a cross-modal key query strategy is proposed for information exchange between parallel channels. In addition, at the output representation stage, we design a simple and effective 3D keypoint selection algorithm to solve the problem that keypoints are likely to appear in the non-significant region. Finally, to improve the accuracy of attitude estimation and meet real-time requirements, a lightweight pose iterative network based on target feature regression is proposed to correct the initial attitude estimation error. Extensive experiments demonstrate the effectiveness and superiority of our method on LineMOD, Occlusion LineMOD, T-Less, and YCB-Video datasets. We demonstrate that our method can improve the 6D pose estimation performance by comparing with the state-of-the-art. Ablation research and visualization validate the design of CMT-6D.
引用
收藏
页码:2011 / 2027
页数:17
相关论文
共 50 条
  • [1] CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal TransformerCMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal TransformerS. Liu et al.
    Suyi Liu
    Fang Xu
    Chengdong Wu
    Jianning Chi
    Xiaosheng Yu
    Longxing Wei
    Chuanjiang Leng
    The Visual Computer, 2025, 41 (3) : 2011 - 2027
  • [2] Cross-modal attention and geometric contextual aggregation network for 6DoF object pose estimation
    Guo, Yi
    Wang, Fei
    Chu, Hao
    Wen, Shiguang
    NEUROCOMPUTING, 2025, 617
  • [3] Depth-based 6DoF Object Pose Estimation using Swin Transformer
    Li, Zhujun
    Stamos, Ioannis
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1185 - 1191
  • [4] A novel 6DoF pose estimation method using transformer fusion
    Wang, Huafeng
    Zhang, Haodu
    Liu, Wanquan
    Hu, Zhimin
    Gao, Haoqi
    Lv, Weifeng
    Gu, Xianfeng
    PATTERN RECOGNITION, 2025, 162
  • [5] A dynamic keypoint selection network for 6DoF pose estimation
    Sun, Haowen
    Wang, Taiyong
    Yu, Enlin
    IMAGE AND VISION COMPUTING, 2022, 118
  • [6] "Recent Methods of 6DoF Pose Estimation"
    Akizuki S.
    Kyokai Joho Imeji Zasshi/Journal of the Institute of Image Information and Television Engineers, 2019, 73 (02): : 210 - 213
  • [7] 6DOF Pose Estimation using 3D Sensors
    Verzijlenberg, Bart
    Jenkin, Michael
    2011 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2011,
  • [8] CMA: Cross-modal attention for 6D object pose estimation
    Zou, Lu
    Huang, Zhangjin
    Wang, Fangjun
    Yang, Zhouwang
    Wang, Guoping
    COMPUTERS & GRAPHICS-UK, 2021, 97 : 139 - 147
  • [9] A 3D Keypoints Voting Network for 6DoF Pose Estimation in Indoor Scene
    Liu, Huikai
    Liu, Gaorui
    Zhang, Yue
    Lei, Linjian
    Xie, Hui
    Li, Yan
    Sun, Shengli
    MACHINES, 2021, 9 (10)
  • [10] Object aspect classification and 6DoF pose estimation
    Dede, Muhammet Ali
    Genc, Yakup
    IMAGE AND VISION COMPUTING, 2022, 124