Enhancing 6-DoF Object Pose Estimation through Multiple Modality Fusion: A Hybrid CNN Architecture with Cross-Layer and Cross-Modal Integration

被引：1

作者：

Wang, Zihang ^{[1
]}

Sun, Xueying ^{[1
,2
]}

Wei, Hao ^{[3
]}

Ma, Qing ^{[1
]}

Zhang, Qiang ^{[1
,2
]}

机构：

[1] Jiangsu Univ Sci & Technol, Coll Automat, 666 Changhui Rd, Zhenjiang 212100, Peoples R China

[2] Jiangsu Univ Sci & Technol, Syst Sci Lab, 666 Changhui Rd, Zhenjiang 212100, Peoples R China

[3] Jiangsu Univ Sci & Technol, Shenlan Coll, 666 Changhui Rd, Zhenjiang 212100, Peoples R China

来源：

MACHINES | 2023年 / 11卷 / 09期

基金：

中国国家自然科学基金;

关键词：

cross layer; cross modality; hybrid CNN architecture; object pose estimation; IMAGE; HISTOGRAMS;

D O I：

10.3390/machines11090891

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Recently, applying the utilization of RGB-D data for robot perception tasks has garnered significant attention in domains like robotics and autonomous driving. However, a prominent challenge in this field lies in the substantial impact of feature robustness on both segmentation and pose estimation tasks. To tackle this challenge, we proposed a pioneering two-stage hybrid Convolutional Neural Network (CNN) architecture, which connects segmentation and pose estimation in tandem. Specifically, we developed Cross-Modal (CM) and Cross-Layer (CL) modules to exploit the complementary information from RGB and depth modalities, as well as the hierarchical features from diverse layers of the network. The CM and CL integration strategy significantly enhanced the segmentation accuracy by effectively capturing spatial and contextual information. Furthermore, we introduced the Convolutional Block Attention Module (CBAM), which dynamically recalibrated the feature maps, enabling the network to focus on informative regions and channels, thereby enhancing the overall performance of the pose estimation task. We conducted extensive experiments on benchmark datasets to evaluate the proposed method and achieved exceptional target pose estimation results, with an average accuracy of 94.5% using the ADD-S AUC metric and 97.6% of ADD-S smaller than 2 cm. These results demonstrate the superior performance of our proposed method.

引用

页数：22

共 7 条

[1] Cross-modal attention and geometric contextual aggregation network for 6DoF object pose estimation
Guo, Yi
Wang, Fei
Chu, Hao
Wen, Shiguang
NEUROCOMPUTING, 2025, 617
[2] CMA: Cross-modal attention for 6D object pose estimation
Zou, Lu
Huang, Zhangjin
Wang, Fangjun
Yang, Zhouwang
Wang, Guoping
COMPUTERS & GRAPHICS-UK, 2021, 97 : 139 - 147
[3] Cross-modal interaction fusion grasping detection based on Transformer-CNN hybrid architecture
Wang, Yong
Li, Yi-Ling
Miao, Duo-Qian
An, Chun-Yan
Yuan, Xin-Lin
Kongzhi yu Juece/Control and Decision, 2024, 39 (11): : 3607 - 3616
[4] 6D Object Pose Estimation Based on Cross-Modality Feature Fusion
Jiang, Meng
Zhang, Liming
Wang, Xiaohua
Li, Shuang
Jiao, Yijie
SENSORS, 2023, 23 (19)
[5] Enhancing target detection accuracy through cross-modal spatial perception and dual-modality fusion
Zhang, Ning
Zhu, Wenqing
FRONTIERS IN PHYSICS, 2024, 12
[6] CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal Transformer
Liu, Suyi
Xu, Fang
Wu, Chengdong
Chi, Jianning
Yu, Xiaosheng
Wei, Longxing
Leng, Chuanjiang
VISUAL COMPUTER, 2025, 41 (03): : 2011 - 2027
[7] CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal TransformerCMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal TransformerS. Liu et al.
Suyi Liu
Fang Xu
Chengdong Wu
Jianning Chi
Xiaosheng Yu
Longxing Wei
Chuanjiang Leng
The Visual Computer, 2025, 41 (3) : 2011 - 2027

← 1 →