Structure-Aware Cross-Modal Transformer for Depth Completion

被引:2
|
作者
Zhao, Linqing [1 ]
Wei, Yi [2 ,3 ]
Li, Jiaxin [4 ]
Zhou, Jie [2 ,3 ]
Lu, Jiwen [2 ,3 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[2] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China
[3] Beijing Natl Res Ctr Informat Sci & Technol BNRist, Beijing 100084, Peoples R China
[4] Gaussian Robot, Shanghai 201203, Peoples R China
基金
中国国家自然科学基金;
关键词
Depth completion; cross-modal interaction; structure learning; transformer; NETWORK; FUSION;
D O I
10.1109/TIP.2024.3355807
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present a Structure-aware Cross-Modal Transformer (SCMT) to fully capture the 3D structures hidden in sparse depths for depth completion. Most existing methods learn to predict dense depths by taking depths as an additional channel of RGB images or learning 2D affinities to perform depth propagation. However, they fail to exploit 3D structures implied in the depth channel, thereby losing the informative 3D knowledge that provides important priors to distinguish the foreground and background features. Moreover, since these methods rely on the color textures of 2D images, it is challenging for them to handle poor-texture regions without the guidance of explicit 3D cues. To address this, we disentangle the hierarchical 3D scene-level structure from the RGB-D input and construct a pathway to make sharp depth boundaries and object shape outlines accessible to 2D features. Specifically, we extract 2D and 3D features from depth inputs and the back-projected point clouds respectively by building a two-stream network. To leverage 3D structures, we construct several cross-modal transformers to adaptively propagate multi-scale 3D structural features to the 2D stream, energizing 2D features with priors of object shapes and local geometries. Experimental results show that our SCMT achieves state-of-the-art performance on three popular outdoor (KITTI) and indoor (VOID and NYU) benchmarks.
引用
收藏
页码:1016 / 1031
页数:16
相关论文
共 50 条
  • [1] Structure-aware contrastive hashing for unsupervised cross-modal retrieval
    Cui, Jinrong
    He, Zhipeng
    Huang, Qiong
    Fu, Yulu
    Li, Yuting
    Wen, Jie
    [J]. NEURAL NETWORKS, 2024, 174
  • [2] Video Entailment via Reaching a Structure-Aware Cross-modal Consensus
    Yao, Xuan
    Gao, Junyu
    Chen, Mengyuan
    Xu, Changsheng
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4240 - 4249
  • [3] Cross-modal Retrieval with Label Completion
    Xu, Xing
    Shen, Fumin
    Yang, Yang
    Shen, Heng Tao
    He, Li
    Song, Jingkuan
    [J]. MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 302 - 306
  • [4] A cross-modal crowd counting method combining CNN and cross-modal transformer
    Zhang, Shihui
    Wang, Wei
    Zhao, Weibo
    Wang, Lei
    Li, Qunpeng
    [J]. IMAGE AND VISION COMPUTING, 2023, 129
  • [5] A Cross-Modal Object-Aware Transformer for Vision-and-Language Navigation
    Ni, Han
    Chen, Jia
    Zhu, DaYong
    Shi, Dianxi
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 976 - 981
  • [6] Structure-Aware Transformer for Graph Representation Learning
    Chen, Dexiong
    O'Bray, Leslie
    Borgwardt, Karsten
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [7] Table Fact Verification with Structure-Aware Transformer
    Zhang, Hongzhi
    Wang, Yingyao
    Wang, Sirui
    Cao, Xuezhi
    Zhang, Fuzheng
    Wang, Zhongyuan
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 1624 - 1629
  • [8] Cross-Modal 360° Depth Completion and Reconstruction for Large-Scale Indoor Environment
    Liu, Ruyu
    Zhang, Guodao
    Wang, Jiangming
    Zhao, Shuwen
    [J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (12) : 25180 - 25190
  • [9] StructCoder: Structure-Aware Transformer for Code Generation
    Tipirneni, Sindhu
    Zhu, Ming
    Reddy, Chandan K.
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2024, 18 (03)
  • [10] CROSS-MODAL PRIMING IN WORD FRAGMENT COMPLETION
    DONALDSON, W
    GENEAU, R
    [J]. BULLETIN OF THE PSYCHONOMIC SOCIETY, 1991, 29 (06) : 514 - 514