Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes

被引:5
|
作者
Li, Rui [1 ]
Gong, Dong [2 ]
Yin, Wei [3 ]
Chen, Hao [4 ]
Zhu, Yu [1 ]
Wang, Kaixuan [3 ]
Chen, Xiaozhi [3 ]
Sun, Jinqiu [1 ]
Zhang, Yanning [1 ]
机构
[1] Northwestern Polytech Univ, Fremont, CA 94539 USA
[2] Univ New South Wales, Sydney, NSW, Australia
[3] DJI, Shenzhen, Peoples R China
[4] Zhejiang Univ, Hangzhou, Peoples R China
基金
澳大利亚研究理事会;
关键词
D O I
10.1109/CVPR52729.2023.02063
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multiview cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.
引用
收藏
页码:21539 / 21548
页数:10
相关论文
共 50 条
  • [1] Self-Supervised Multi-Frame Monocular Depth Estimation for Dynamic Scenes
    Wu, Guanghui
    Liu, Hao
    Wang, Longguang
    Li, Kunhong
    Guo, Yulan
    Chen, Zengping
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (06) : 4989 - 5001
  • [2] Monocular depth estimation with multi-view attention autoencoder
    Geunho Jung
    Sang Min Yoon
    [J]. Multimedia Tools and Applications, 2022, 81 : 33759 - 33770
  • [3] Monocular depth estimation with multi-view attention autoencoder
    Jung, Geunho
    Yoon, Sang Min
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (23) : 33759 - 33770
  • [4] Multi-Frame Self-Supervised Depth Estimation with Multi-Scale Feature Fusion in Dynamic Scenes
    Zhong, Jiquan
    Huang, Xiaolin
    Yu, Xiao
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2553 - 2563
  • [5] Mono-SF: Multi-View Geometry Meets Single-View Depth for Monocular Scene Flow Estimation of Dynamic Traffic Scenes
    Brickwedde, Fabian
    Abraham, Steffen
    Mester, Rudolf
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2780 - 2790
  • [6] GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular Multi-Frame Depth Estimation
    Fang, Naiyu
    Qiu, Lemiao
    Zhang, Shuyou
    Wang, Zili
    Zhou, Zheyuan
    Hu, Kerui
    [J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (03) : 2256 - 2263
  • [7] Multi-View Depth Estimation by Using Adaptive Point Graph to Fuse Single-View Depth Probabilities
    Wang, Ke
    Liu, Chuhao
    Liu, Zhanwen
    Xiao, Fangwei
    An, Yisheng
    Zhao, Xiangmo
    Shen, Shaojie
    [J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (07): : 6400 - 6407
  • [8] Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry
    Bae, Gwangbin
    Budvytis, Ignas
    Cipolla, Roberto
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2832 - 2841
  • [9] Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth
    Feng, Ziyue
    Yang, Liang
    Jing, Longlong
    Wang, Haiyan
    Tian, YingLi
    Li, Bing
    [J]. COMPUTER VISION - ECCV 2022, PT XXXII, 2022, 13692 : 228 - 244
  • [10] M-FUSE: Multi-frame Fusion for Scene Flow Estimation
    Mehl, Lukas
    Jahedi, Azin
    Schmalfuss, Jenny
    Bruhn, Andres
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2019 - 2028