OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

被引:0
|
作者
Zhang, Hu [1 ]
Ku, Jianhua [4 ]
Tang, Tao [5 ]
Sun, Haiyang [6 ]
Huang, Xin [2 ]
Huang, Zi [2 ]
Yu, Kaicheng [3 ]
机构
[1] CSIRO DATA61, Sydney, NSW, Australia
[2] Univ Queensland, Brisbane, Qld, Australia
[3] Westlake Univ, Hangzhou, Peoples R China
[4] Alibaba, DAMO Acad, Beijing, Peoples R China
[5] Sun Yat Sen Univ, Shenzhen Campus, Shenzhen, Peoples R China
[6] LiAuto Inc, Beijing, Peoples R China
来源
关键词
OpenSight; Open-vocabulary; 3D object detection; VOXELNET;
D O I
10.1007/978-3-031-72907-2_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional LiDAR-based object detection research primarily focuses on closed-set scenarios, which falls short in complex real-world applications. Directly transferring existing 2D open-vocabulary models with some known LiDAR classes for open-vocabulary ability, however, tends to suffer from over-fitting problems: The obtained model will detect the known objects, even presented with a novel category. In this paper, we propose OpenSight, a more advanced 2D-3D modeling framework for LiDAR-based open-vocabulary detection. OpenSight utilizes 2D-3D geometric priors for the initial discernment and localization of generic objects, followed by a more specific semantic interpretation of the detected objects. The process begins by generating 2D boxes for generic objects from the accompanying camera images of LiDAR. These 2D boxes, together with LiDAR points, are then lifted back into the LiDAR space to estimate corresponding 3D boxes. For better generic object perception, our framework integrates both temporal and spatial-aware constraints. Temporal awareness correlates the predicted 3D boxes across consecutive timestamps, recalibrating the missed or inaccurate boxes. The spatial awareness randomly places some "precisely" estimated 3D boxes at varying distances, increasing the visibility of generic objects. To interpret the specific semantics of detected objects, we develop a cross-modal alignment and fusion module to first align 3D features with 2D image embeddings and then fuse the aligned 3D-2D features for semantic decoding. Our experiments indicate that our method establishes state-of-the-art open-vocabulary performance on widely used 3D detection benchmarks and effectively identifies objects for new categories of interest.
引用
收藏
页码:1 / 19
页数:19
相关论文
共 50 条
  • [31] Multi-modal Prompts with Feature Decoupling for Open-Vocabulary Object Detection
    Wang, Duorui
    Zhao, Xiaowei
    GENERALIZING FROM LIMITED RESOURCES IN THE OPEN WORLD, GLOW-IJCAI 2024, 2024, 2160 : 180 - 194
  • [32] Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
    Xu, Yifan
    Zhang, Mengdan
    Yang, Xiaoshan
    Xu, Changsheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6253 - 6267
  • [33] DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
    Yao, Lewei
    Pi, Renjie
    Hang, Jianhua
    Liang, Xiaodan
    Xu, Hang
    Zhang, Wei
    Li, Zhenguo
    Xu, Dan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27381 - 27391
  • [34] Open-Vocabulary Object Detection by Novel-Class Feature Perception Enhancement
    Hui, Kanghua
    Cai, Xianqiao
    Zhang, Zhi
    Huang, Rui
    Liu, Qing
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14865 : 220 - 231
  • [35] Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection
    Li, Liangqi
    Miao, Jiaxu
    Shi, Dahu
    Tan, Wenming
    Ren, Ye
    Yang, Yi
    Pu, Shiliang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6478 - 6487
  • [36] OvarNet: Towards Open-vocabulary Object Attribute Recognition
    Chen, Keyan
    Jiang, Xiaolong
    Hu, Yao
    Tang, Xu
    Gao, Yan
    Chen, Jianqi
    Xie, Weidi
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23518 - 23527
  • [37] Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection
    Rasheed, Hanoona
    Maaz, Muhammad
    Khattak, Muhammad Uzair
    Khan, Salman
    Khan, Fahad Shahbaz
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [38] Reframing: Detector-Specific Prompt Tuning for Enhancing Open-Vocabulary Object Detection
    Avshalumov, Mikhail
    Volovikova, Zoya
    Yudin, Dmitry
    Panov, Alexandr
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, PT II, HAIS 2024, 2025, 14858 : 128 - 140
  • [39] MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
    Wang, Kuo
    Cheng, Lechao
    Chen, Weikai
    Zhang, Pingping
    Lin, Liang
    Zhou, Fan
    Li, Guanbin
    COMPUTER VISION - ECCV 2024, PT XVII, 2025, 15075 : 106 - 122
  • [40] Federated fine-grained prompts for vision-language models based on open-vocabulary object detection
    Li, Yu
    APPLIED INTELLIGENCE, 2025, 55 (07)