LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

被引：0

作者：

Du, Penghui ^{[1
,2
,3
]}

Wang, Yu ^{[2
]}

Sung, Yifan ^{[2
]}

Wang, Luting ^{[1
]}

Li, Yue ^{[1
]}

Zhang, Gang ^{[2
]}

Ding, Errui ^{[2
]}

Wang, Yan ^{[3
]}

Wang, Jingdong ^{[2
]}

Liu, Si ^{[1
]}

机构：

[1] Beihang Univ, Beijing, Peoples R China

[2] Baidu, Beijing, Peoples R China

[3] Tsinghua Univ, AIR, Beijing, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT XXIII | 2025年 / 15081卷

基金：

北京市自然科学基金; 中国国家自然科学基金;

关键词：

Inter-category Relationships; Language Model; DETR;

D O I：

10.1007/978-3-031-73337-6_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs), such as CLIP. However, two main challenges emerge: (1) A deficiency in concept representation, where the category names in CLIP's text space lack textual and visual knowledge. (2) An overfitting tendency towards base categories, with the open vocabulary knowledge biased towards base categories during the transfer from VLMs to detectors. To address these challenges, we propose the Language Model Instruction (LaMI) strategy, which leverages the relationships between visual concepts and applies them within a simple yet effective DETR-like detector, termed LaMI-DETR. LaMI utilizes GPT to construct visual concepts and employs T5 to investigate visual similarities across categories. These inter-category relationships refine concept representation and avoid overfitting to base categories. Comprehensive experiments validate our approach's superior performance over existing methods in the same rigorous setting without reliance on external training resources. LaMI-DETR achieves a rare box AP of 43.4 on OV-LVIS, surpassing the previous best by 7.8 rare box AP.

引用

页码：312 / 328

页数：17

共 50 条

[31] MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
Wang, Kuo
Cheng, Lechao
Chen, Weikai
Zhang, Pingping
Lin, Liang
Zhou, Fan
Li, Guanbin
COMPUTER VISION - ECCV 2024, PT XVII, 2025, 15075 : 106 - 122
[32] A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model
Xu, Mengde
Zhang, Zheng
Wei, Fangyun
Lin, Yutong
Cao, Yue
Hu, Han
Bai, Xiang
COMPUTER VISION, ECCV 2022, PT XXIX, 2022, 13689 : 736 - 753
[33] Open-Vocabulary Affordance Detection in 3D Point Clouds
Toan Nguyen
Minh Nhat Vu
An Vuong
Dzung Nguyen
Thieu Vo
Ngan Le
Anh Nguyen
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 5692 - 5698
[34] EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment
Shi, Cheng
Yang, Sibei
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15678 - 15688
[35] OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition
Chen, Keyan
Jiang, Xiaolong
Wang, Haochen
Yan, Cilin
Gao, Yan
Tang, Xu
Hu, Yao
Xie, Weidi
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (11) : 5387 - 5409
[36] Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection
Wang, Luting
Liu, Yi
Du, Penghui
Ding, Zihan
Liao, Yue
Qi, Qiaosong
Chen, Biaolong
Liu, Si
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11186 - 11196
[37] Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision
Xu, Jilan
Hou, Junlin
Zhang, Yuejie
Feng, Rui
Wang, Yi
Qiao, Yu
Xie, Weidi
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2935 - 2944
[38] Data Augmentation by Data Noising for Open-vocabulary Slots in Spoken Language Understanding
Kim, Hwa-Yeon
Roh, Yoon-Hyung
Kim, Young-Kil
NAACL HLT 2019: THE 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2019, : 97 - 102
[39] Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching
Zhang, Hao
Xu, Lumin
Lai, Shenqi
Shao, Wenqi
Zheng, Nanning
Luo, Ping
Qiao, Yu
Zhang, Kaipeng
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (12) : 5741 - 5758
[40] Predicting detection filters for small footprint open-vocabulary keyword spotting
Bluche, Theodore
Gisselbrecht, Thibault
INTERSPEECH 2020, 2020, : 2552 - 2556

← 1 2 3 4 5 →