TransCNNLoc: End-to-end pixel-level learning for 2D-to-3D pose estimation in dynamic indoor scenes

被引:2
|
作者
Tang, Shengjun [1 ,2 ]
Li, Yusong [3 ]
Wan, Jiawei [1 ]
Li, You [4 ]
Zhou, Baoding [5 ]
Guo, Renzhong [1 ,2 ]
Wang, Weixi [1 ,2 ]
Feng, Yuhong [3 ]
机构
[1] Shenzhen Univ, Res Inst Smart Cities, Sch Architecture & Urban Planning, Shenzhen, Peoples R China
[2] State Key Lab Subtrop Bldg & Urban Sci, Guangzhou, Peoples R China
[3] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China
[4] Guangdong Lab Artificial Intelligence & Digital Ec, Shenzhen, Peoples R China
[5] Shenzhen Univ, Coll Civil & Transportat Engn, Shenzhen, Peoples R China
关键词
Indoor localization; Feature learning; Structure from motion; Levenberg-Marquardt; Image retrieval;
D O I
10.1016/j.isprsjprs.2023.12.006
中图分类号
P9 [自然地理学];
学科分类号
0705 ; 070501 ;
摘要
Accurate localization in GPS-denied environments has always been a core issue in computer vision and robotics research. In indoor environments, vision-based localization methods are susceptible to changes in lighting conditions, viewing angles, and environmental factors, resulting in localization failures or limited generalization capabilities. In this paper, we propose the TransCNNLoc framework, which consists of an encoding-decoding network designed to learn more robust image features for camera pose estimation. In the image feature encoding stage, CNN and Swin Transformer are integrated to construct the image feature encoding module, enabling the network to fully extract global context and local features from images. In the decoding stage, multi-level image features are decoded through cross-layer connections while computing per-pixel feature weight maps. To enhance the framework's robustness to dynamic objects, a dynamic object recognition network is introduced to optimize the feature weights. Finally, a multi-level iterative optimization from coarse to fine levels is performed to recover six degrees of freedom camera pose. Experiments were conducted on the publicly available 7scenes dataset as well as a dataset collected under changing lighting conditions and dynamic scenes for accuracy validation and analysis. The experimental results demonstrate that the proposed TransCNNLoc framework exhibits superior adaptability to dynamic scenes and lighting changes. In the context of static environments within publicly available datasets, the localization technique introduced in this study attains a maximal precision of up to 5 centimeters, consistently achieving superior outcomes across a majority of the scenarios. Under the conditions of dynamic scenes and fluctuating illumination, this approach demonstrates an enhanced precision capability, reaching up to 3 centimeters. This represents a substantial refinement from the decimeter scale to a centimeter scale in precision, marking a significant advancement over the existing state-of-the-art (SOTA) algorithms. The open-source repository for the method proposed in this paper can be found at the following URL: github.com/Geelooo/TransCNNloc.
引用
收藏
页码:218 / 230
页数:13
相关论文
共 50 条
  • [1] End-to-end 3D Human Pose Estimation with Transformer
    Zhang, Bowei
    Cui, Peng
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4529 - 4536
  • [2] An end-to-end framework for unconstrained monocular 3D hand pose estimation
    Sharma, Sanjeev
    Huang, Shaoli
    PATTERN RECOGNITION, 2021, 115
  • [3] DeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning from Synthetic Depth
    Malik, Jameel
    Elhayek, Ahmed
    Nunnari, Fabrizio
    Varanasi, Kiran
    Tamaddon, Kiarash
    Heloir, Alexis
    Stricker, Didier
    2018 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2018, : 110 - 119
  • [4] LiteDEKR: End-to-end lite 2D human pose estimation network
    Lv, Xueqiang
    Hao, Wei
    Tian, Lianghai
    Han, Jing
    Chen, Yuzhong
    Cai, Zangtai
    IET IMAGE PROCESSING, 2023, 17 (12) : 3392 - 3400
  • [5] Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation
    Liu, Fulin
    Hu, Yinlin
    Salzmann, Mathieu
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14061 - 14071
  • [6] End2End Semantic Segmentation for 3D Indoor Scenes
    Zhao, Na
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 810 - 814
  • [7] End-to-end learning of keypoint detector and descriptor for pose invariant 3D matching
    Georgakis, Georgios
    Karanam, Srikrishna
    Wu, Ziyan
    Ernst, Jan
    Kosecka, Jana
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1965 - 1973
  • [8] IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation
    Qiu, Zhongwei
    Yang, Qiansheng
    Wang, Jian
    Fu, Dongmei
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 6174 - 6182
  • [9] An AI-empowered Cloud Solution towards End-to-End 2D-to-3D Image Conversion for Autostereoscopic 3D Display
    Lim, Jun Wei
    Yeo, Jin Qi
    Xia, Xinxing
    Guan, Frank
    28TH ACM SYMPOSIUM ON VIRTUAL REALITY SOFTWARE AND TECHNOLOGY, VRST 2022, 2022,
  • [10] FusionNet: An End-to-End Hybrid Model for 6D Object Pose Estimation
    Ye, Yuning
    Park, Hanhoon
    ELECTRONICS, 2023, 12 (19)