Enhancing 6-DoF Object Pose Estimation through Multiple Modality Fusion: A Hybrid CNN Architecture with Cross-Layer and Cross-Modal Integration

被引:1
|
作者
Wang, Zihang [1 ]
Sun, Xueying [1 ,2 ]
Wei, Hao [3 ]
Ma, Qing [1 ]
Zhang, Qiang [1 ,2 ]
机构
[1] Jiangsu Univ Sci & Technol, Coll Automat, 666 Changhui Rd, Zhenjiang 212100, Peoples R China
[2] Jiangsu Univ Sci & Technol, Syst Sci Lab, 666 Changhui Rd, Zhenjiang 212100, Peoples R China
[3] Jiangsu Univ Sci & Technol, Shenlan Coll, 666 Changhui Rd, Zhenjiang 212100, Peoples R China
基金
中国国家自然科学基金;
关键词
cross layer; cross modality; hybrid CNN architecture; object pose estimation; IMAGE; HISTOGRAMS;
D O I
10.3390/machines11090891
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, applying the utilization of RGB-D data for robot perception tasks has garnered significant attention in domains like robotics and autonomous driving. However, a prominent challenge in this field lies in the substantial impact of feature robustness on both segmentation and pose estimation tasks. To tackle this challenge, we proposed a pioneering two-stage hybrid Convolutional Neural Network (CNN) architecture, which connects segmentation and pose estimation in tandem. Specifically, we developed Cross-Modal (CM) and Cross-Layer (CL) modules to exploit the complementary information from RGB and depth modalities, as well as the hierarchical features from diverse layers of the network. The CM and CL integration strategy significantly enhanced the segmentation accuracy by effectively capturing spatial and contextual information. Furthermore, we introduced the Convolutional Block Attention Module (CBAM), which dynamically recalibrated the feature maps, enabling the network to focus on informative regions and channels, thereby enhancing the overall performance of the pose estimation task. We conducted extensive experiments on benchmark datasets to evaluate the proposed method and achieved exceptional target pose estimation results, with an average accuracy of 94.5% using the ADD-S AUC metric and 97.6% of ADD-S smaller than 2 cm. These results demonstrate the superior performance of our proposed method.
引用
收藏
页数:22
相关论文
共 7 条
  • [1] Cross-modal attention and geometric contextual aggregation network for 6DoF object pose estimation
    Guo, Yi
    Wang, Fei
    Chu, Hao
    Wen, Shiguang
    NEUROCOMPUTING, 2025, 617
  • [2] CMA: Cross-modal attention for 6D object pose estimation
    Zou, Lu
    Huang, Zhangjin
    Wang, Fangjun
    Yang, Zhouwang
    Wang, Guoping
    COMPUTERS & GRAPHICS-UK, 2021, 97 : 139 - 147
  • [3] Cross-modal interaction fusion grasping detection based on Transformer-CNN hybrid architecture
    Wang, Yong
    Li, Yi-Ling
    Miao, Duo-Qian
    An, Chun-Yan
    Yuan, Xin-Lin
    Kongzhi yu Juece/Control and Decision, 2024, 39 (11): : 3607 - 3616
  • [4] 6D Object Pose Estimation Based on Cross-Modality Feature Fusion
    Jiang, Meng
    Zhang, Liming
    Wang, Xiaohua
    Li, Shuang
    Jiao, Yijie
    SENSORS, 2023, 23 (19)
  • [5] Enhancing target detection accuracy through cross-modal spatial perception and dual-modality fusion
    Zhang, Ning
    Zhu, Wenqing
    FRONTIERS IN PHYSICS, 2024, 12
  • [6] CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal Transformer
    Liu, Suyi
    Xu, Fang
    Wu, Chengdong
    Chi, Jianning
    Yu, Xiaosheng
    Wei, Longxing
    Leng, Chuanjiang
    VISUAL COMPUTER, 2025, 41 (03): : 2011 - 2027
  • [7] CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal TransformerCMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal TransformerS. Liu et al.
    Suyi Liu
    Fang Xu
    Chengdong Wu
    Jianning Chi
    Xiaosheng Yu
    Longxing Wei
    Chuanjiang Leng
    The Visual Computer, 2025, 41 (3) : 2011 - 2027