Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

被引:0
|
作者
Huang, Linyan [1 ]
Li, Zhiqi [2 ]
Sima, Chonghao [1 ]
Wang, Wenhai [3 ]
Wang, Jingdong [4 ]
Qiao, Yu [1 ]
Li, Hongyang [1 ]
机构
[1] Shanghai AI Lab, Shanghai, Peoples R China
[2] Nanjing Univ, Nanjing, Peoples R China
[3] CUHK, Hong Kong, Peoples R China
[4] Baidu, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDARor multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS. The code will be released at https://github.com/OpenDriveLab/Birds-eye-view-Perception.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Artifacts Mapping: Multi-Modal Semantic Mapping for Object Detection and 3D Localization
    Rollo, Federico
    Raiola, Gennaro
    Zunino, Andrea
    Tsagarakis, Nikolaos
    Ajoudani, Arash
    2023 EUROPEAN CONFERENCE ON MOBILE ROBOTS, ECMR, 2023, : 90 - 97
  • [22] RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM
    Song, Ziying
    Zhang, Guoxing
    Liu, Lin
    Yang, Lei
    Xu, Shaoqing
    Jia, Caiyan
    Jia, Feiyang
    Wang, Li
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1272 - 1280
  • [23] Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection
    Liu, Zhanwen
    Cheng, Juanru
    Fan, Jin
    Lin, Shan
    Wang, Yang
    Zhao, Xiangmo
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 707 - 717
  • [24] Height-Adaptive Deformable Multi-Modal Fusion for 3D Object Detection
    Li, Jiahao
    Chen, Lingshan
    Li, Zhen
    IEEE ACCESS, 2025, 13 : 52385 - 52396
  • [25] Frustum FusionNet: Amodal 3D Object Detection with Multi-Modal Feature Fusion
    Zuo, Liangyu
    Li, Yaochen
    Han, Mengtao
    Li, Qiao
    Liu, Yuehu
    2021 IEEE INTELLIGENT TRANSPORTATION SYSTEMS CONFERENCE (ITSC), 2021, : 2746 - 2751
  • [26] Enhancing 3D object detection through multi-modal fusion for cooperative perception
    Xia, Bin
    Zhou, Jun
    Kong, Fanyu
    You, Yuhe
    Yang, Jiarui
    Lin, Lin
    ALEXANDRIA ENGINEERING JOURNAL, 2024, 104 : 46 - 55
  • [27] Regulating Intermediate 3D Features for Vision-Centric Autonomous Driving
    Xu, Junkai
    Peng, Liang
    Cheng, Haoran
    Xia, Linxuan
    Zhou, Qi
    Deng, Dan
    Qian, Wei
    Wang, Wenxiao
    Cai, Deng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6306 - 6314
  • [28] Leveraging Uncertainties for Deep Multi-modal Object Detection in Autonomous Driving
    Feng, Di
    Cao, Yifan
    Rosenbaum, Lars
    Timm, Fabian
    Dietmayer, Klaus
    2020 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2020, : 871 - 878
  • [29] MMDistill: Multi-Modal BEV Distillation Framework for Multi-View 3D Object Detection
    Jiao, Tianzhe
    Chen, Yuming
    Zhang, Zhe
    Guo, Chaopeng
    Song, Jie
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 81 (03): : 4307 - 4325
  • [30] SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection
    Xie, Yichen
    Xu, Chenfeng
    Rakotosaona, Marie-Julie
    Rim, Patrick
    Tombari, Federico
    Keutzer, Kurt
    Tomizuka, Masayoshi
    Zhan, Wei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 17545 - 17556