Structure-Aware Cross-Modal Transformer for Depth Completion

被引:2
|
作者
Zhao, Linqing [1 ]
Wei, Yi [2 ,3 ]
Li, Jiaxin [4 ]
Zhou, Jie [2 ,3 ]
Lu, Jiwen [2 ,3 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[2] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China
[3] Beijing Natl Res Ctr Informat Sci & Technol BNRist, Beijing 100084, Peoples R China
[4] Gaussian Robot, Shanghai 201203, Peoples R China
基金
中国国家自然科学基金;
关键词
Depth completion; cross-modal interaction; structure learning; transformer; NETWORK; FUSION;
D O I
10.1109/TIP.2024.3355807
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present a Structure-aware Cross-Modal Transformer (SCMT) to fully capture the 3D structures hidden in sparse depths for depth completion. Most existing methods learn to predict dense depths by taking depths as an additional channel of RGB images or learning 2D affinities to perform depth propagation. However, they fail to exploit 3D structures implied in the depth channel, thereby losing the informative 3D knowledge that provides important priors to distinguish the foreground and background features. Moreover, since these methods rely on the color textures of 2D images, it is challenging for them to handle poor-texture regions without the guidance of explicit 3D cues. To address this, we disentangle the hierarchical 3D scene-level structure from the RGB-D input and construct a pathway to make sharp depth boundaries and object shape outlines accessible to 2D features. Specifically, we extract 2D and 3D features from depth inputs and the back-projected point clouds respectively by building a two-stream network. To leverage 3D structures, we construct several cross-modal transformers to adaptively propagate multi-scale 3D structural features to the 2D stream, energizing 2D features with priors of object shapes and local geometries. Experimental results show that our SCMT achieves state-of-the-art performance on three popular outdoor (KITTI) and indoor (VOID and NYU) benchmarks.
引用
收藏
页码:1016 / 1031
页数:16
相关论文
共 50 条
  • [21] Efficient Nonconvex Regularized Tensor Completion with Structure-aware Proximal Iterations
    Yao, Quanming
    Kwok, James T.
    Han, Bo
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [22] Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
    Shukor, Mustafa
    Couairon, Guillaume
    Grechka, Asya
    Cord, Matthieu
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4566 - 4577
  • [23] Cross-modal transformer with language query for referring image segmentation
    Zhang, Wenjing
    Tan, Quange
    Li, Pengxin
    Zhang, Qi
    Wang, Rong
    [J]. NEUROCOMPUTING, 2023, 536 : 191 - 205
  • [24] Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation
    Li, Mingjie
    Cai, Wenjia
    Verspoor, Karin
    Pan, Shirui
    Liang, Xiaodan
    Chang, Xiaojun
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20624 - 20633
  • [25] Transformer-Exclusive Cross-Modal Representation for Vision and Language
    Shin, Andrew
    Narihira, Takuya
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2719 - 2725
  • [26] Structure-Aware Residual Pyramid Network for Monocular Depth Estimation
    Chen, Xiaotian
    Chen, Xuejin
    Zha, Zheng-Jun
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 694 - 700
  • [27] Decoupled Cross-Modal Transformer for Referring Video Object Segmentation
    Wu, Ao
    Wang, Rong
    Tan, Quange
    Song, Zhenfeng
    [J]. SENSORS, 2024, 24 (16)
  • [28] Cascaded cross-modal transformer for audio-textual classification
    Ristea, Nicolae-Catalin
    Anghel, Andrei
    Ionescu, Radu Tudor
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2024, 57 (09)
  • [29] Modal-Aware Resource Allocation for Cross-Modal Collaborative Communication in IIoT
    Chen, Mingkai
    Zhao, Lindong
    Chen, Jianxin
    Wei, Xin
    Guizani, Mohsen
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (17): : 14952 - 14964
  • [30] UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation
    Ying, Xiaowen
    Chuah, Mooi Choo
    [J]. COMPUTER VISION - ECCV 2022, PT XXX, 2022, 13690 : 20 - 37