Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

被引:0
|
作者
Huang, Linyan [1 ]
Li, Zhiqi [2 ]
Sima, Chonghao [1 ]
Wang, Wenhai [3 ]
Wang, Jingdong [4 ]
Qiao, Yu [1 ]
Li, Hongyang [1 ]
机构
[1] Shanghai AI Lab, Shanghai, Peoples R China
[2] Nanjing Univ, Nanjing, Peoples R China
[3] CUHK, Hong Kong, Peoples R China
[4] Baidu, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDARor multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS. The code will be released at https://github.com/OpenDriveLab/Birds-eye-view-Perception.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Multi-modal information fusion for LiDAR-based 3D object detection framework
    Ma, Ruixin
    Yin, Yong
    Chen, Jing
    Chang, Rihao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7995 - 8012
  • [42] Unlocking the power of multi-modal fusion in 3D object tracking
    Hu, Yue
    IET COMPUTER VISION, 2025, 19 (01)
  • [43] AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection
    Chen, Zehui
    Li, Zhenyu
    Zhang, Shiquan
    Fang, Liangji
    Jiang, Qinhong
    Zhao, Feng
    Zhou, Bolei
    Zhao, Hang
    PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, 2022, : 827 - 833
  • [44] A hierarchical occupancy network with multi-height attention for vision-centric 3D occupancy prediction
    Li, Can
    Gao, Zhi
    Lin, Zhipeng
    Ye, Tonghui
    Li, Ziyao
    PHOTOGRAMMETRIC RECORD, 2024, 39 (187): : 600 - 614
  • [45] A Multi-Modal Fusion-Based 3D Multi-Object Tracking Framework With Joint Detection
    Wang, Xiyang
    Fu, Chunyun
    He, Jiawei
    Huang, Mingguang
    Meng, Ting
    Zhang, Siyu
    Zhou, Hangning
    Xu, Ziyao
    Zhang, Chi
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (01): : 532 - 539
  • [46] Multi-modal Data Analysis and Fusion for Robust Object Detection in 2D/3D Sensing
    Schierl, Jonathan
    Graehling, Quinn
    Aspiras, Theus
    Asari, Vijay
    Van Rynbach, Andre
    Rabb, Dave
    2020 IEEE APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP (AIPR): TRUSTED COMPUTING, PRIVACY, AND SECURING MULTIMEDIA, 2020,
  • [47] Cross Diffusion on Multi-hypergraph for Multi-modal 3D Object Recognition
    Zhang, Zizhao
    Lin, Haojie
    Zhu, Junjie
    Zhao, Xibin
    Gao, Yue
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 38 - 49
  • [48] Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving
    Chiu, Hsu-kuang
    Lie, Jie
    Ambrus, Rares
    Bohg, Jeannette
    2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 14227 - 14233
  • [49] Exploiting Multi-Modal Synergies for Enhancing 3D Multi-Object Tracking
    Xu, Xinglong
    Ren, Weihong
    Chen, Xi'ai
    Fan, Huijie
    Han, Zhi
    Liu, Honghai
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (10): : 8643 - 8650
  • [50] GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection
    Song, Ziying
    Wei, Haiyue
    Bai, Lin
    Yang, Lei
    Jia, Caiyan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3335 - 3346