Structure-Aware Cross-Modal Transformer for Depth Completion

被引：2

作者：

Zhao, Linqing ^{[1
]}

Wei, Yi ^{[2
,3
]}

Li, Jiaxin ^{[4
]}

Zhou, Jie ^{[2
,3
]}

Lu, Jiwen ^{[2
,3
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[2] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China

[3] Beijing Natl Res Ctr Informat Sci & Technol BNRist, Beijing 100084, Peoples R China

[4] Gaussian Robot, Shanghai 201203, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

基金：

中国国家自然科学基金;

关键词：

Depth completion; cross-modal interaction; structure learning; transformer; NETWORK; FUSION;

D O I：

10.1109/TIP.2024.3355807

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we present a Structure-aware Cross-Modal Transformer (SCMT) to fully capture the 3D structures hidden in sparse depths for depth completion. Most existing methods learn to predict dense depths by taking depths as an additional channel of RGB images or learning 2D affinities to perform depth propagation. However, they fail to exploit 3D structures implied in the depth channel, thereby losing the informative 3D knowledge that provides important priors to distinguish the foreground and background features. Moreover, since these methods rely on the color textures of 2D images, it is challenging for them to handle poor-texture regions without the guidance of explicit 3D cues. To address this, we disentangle the hierarchical 3D scene-level structure from the RGB-D input and construct a pathway to make sharp depth boundaries and object shape outlines accessible to 2D features. Specifically, we extract 2D and 3D features from depth inputs and the back-projected point clouds respectively by building a two-stream network. To leverage 3D structures, we construct several cross-modal transformers to adaptively propagate multi-scale 3D structural features to the 2D stream, energizing 2D features with priors of object shapes and local geometries. Experimental results show that our SCMT achieves state-of-the-art performance on three popular outdoor (KITTI) and indoor (VOID and NYU) benchmarks.

引用

页码：1016 / 1031

页数：16

共 50 条

[21] Efficient Nonconvex Regularized Tensor Completion with Structure-aware Proximal Iterations
Yao, Quanming
Kwok, James T.
Han, Bo
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
[22] Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
Shukor, Mustafa
Couairon, Guillaume
Grechka, Asya
Cord, Matthieu
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4566 - 4577
[23] Cross-modal transformer with language query for referring image segmentation
Zhang, Wenjing
Tan, Quange
Li, Pengxin
Zhang, Qi
Wang, Rong
[J]. NEUROCOMPUTING, 2023, 536 : 191 - 205
[24] Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation
Li, Mingjie
Cai, Wenjia
Verspoor, Karin
Pan, Shirui
Liang, Xiaodan
Chang, Xiaojun
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20624 - 20633
[25] Transformer-Exclusive Cross-Modal Representation for Vision and Language
Shin, Andrew
Narihira, Takuya
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2719 - 2725
[26] Structure-Aware Residual Pyramid Network for Monocular Depth Estimation
Chen, Xiaotian
Chen, Xuejin
Zha, Zheng-Jun
[J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 694 - 700
[27] Decoupled Cross-Modal Transformer for Referring Video Object Segmentation
Wu, Ao
Wang, Rong
Tan, Quange
Song, Zhenfeng
[J]. SENSORS, 2024, 24 (16)
[28] Cascaded cross-modal transformer for audio-textual classification
Ristea, Nicolae-Catalin
Anghel, Andrei
Ionescu, Radu Tudor
[J]. ARTIFICIAL INTELLIGENCE REVIEW, 2024, 57 (09)
[29] Modal-Aware Resource Allocation for Cross-Modal Collaborative Communication in IIoT
Chen, Mingkai
Zhao, Lindong
Chen, Jianxin
Wei, Xin
Guizani, Mohsen
[J]. IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (17): : 14952 - 14964
[30] UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation
Ying, Xiaowen
Chuah, Mooi Choo
[J]. COMPUTER VISION - ECCV 2022, PT XXX, 2022, 13690 : 20 - 37

← 1 2 3 4 5 →