Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

被引：0

作者：

Huang, Linyan ^{[1
]}

Li, Zhiqi ^{[2
]}

Sima, Chonghao ^{[1
]}

Wang, Wenhai ^{[3
]}

Wang, Jingdong ^{[4
]}

Qiao, Yu ^{[1
]}

Li, Hongyang ^{[1
]}

机构：

[1] Shanghai AI Lab, Shanghai, Peoples R China

[2] Nanjing Univ, Nanjing, Peoples R China

[3] CUHK, Hong Kong, Peoples R China

[4] Baidu, Beijing, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

国家重点研发计划;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDARor multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS. The code will be released at https://github.com/OpenDriveLab/Birds-eye-view-Perception.

引用

页数：16

共 50 条

[1] ObjectFusion: Multi-modal 3D Object Detection with Object-Centric Fusion
Cai, Qi
Pan, Yingwei
Yao, Ting
Ngo, Chong-Wah
Mei, Tao
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 18021 - 18030
[2] LiDAR-guided Geometric Pretraining for Vision-Centric 3D Object Detection
Huang, Linyan
Wang, Huijie
Zeng, Jia
Zhang, Shengchuan
Cao, Liujuan
Yan, Junchi
Li, Hongyang
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
[3] Multi-Modal Streaming 3D Object Detection
Abdelfattah, Mazen
Yuan, Kaiwen
Wang, Z. Jane
Ward, Rabab
IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (10) : 6163 - 6170
[4] Multi-Modal 3D Object Detection by Box Matching
Liu, Zhe
Ye, Xiaoqing
Zou, Zhikang
He, Xinwei
Tan, Xiao
Ding, Errui
Wang, Jingdong
Bai, Xiang
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024,
[5] Quantization to accelerate inference in multi-modal 3D object detection
Geerhart, Billy
Dasari, Venkat R.
Rapp, Brian
Wang, Peng
Wang, Ju
Payne, Christopher X.
DISRUPTIVE TECHNOLOGIES IN INFORMATION SCIENCES VIII, 2024, 13058
[6] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
Wang, Yingjie
Mao, Qiuyu
Zhu, Hanqi
Deng, Jiajun
Zhang, Yu
Ji, Jianmin
Li, Houqiang
Zhang, Yanyong
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (08) : 2122 - 2152
[7] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
Yingjie Wang
Qiuyu Mao
Hanqi Zhu
Jiajun Deng
Yu Zhang
Jianmin Ji
Houqiang Li
Yanyong Zhang
International Journal of Computer Vision, 2023, 131 : 2122 - 2152
[8] Teaching robots to do object assembly using multi-modal 3D vision
Wan, Weiwei
Lu, Feng
Wu, Zepei
Harada, Kensuke
NEUROCOMPUTING, 2017, 259 : 85 - 93
[9] Deep multi-scale and multi-modal fusion for 3D object detection
Guo, Rui
Li, Deng
Han, Yahong
PATTERN RECOGNITION LETTERS, 2021, 151 : 236 - 242
[10] Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection
Li, Xin
Shi, Botian
Hou, Yuenan
Wu, Xingjiao
Ma, Tianlong
Li, Yikang
He, Liang
COMPUTER VISION, ECCV 2022, PT XXXVIII, 2022, 13698 : 691 - 707

← 1 2 3 4 5 →