Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

被引:0
|
作者
Rasheed, Hanoona [1 ]
Maaz, Muhammad [1 ]
Khattak, Muhammad Uzair [1 ]
Khan, Salman [1 ,2 ]
Khan, Fahad Shahbaz [1 ,3 ]
机构
[1] Mohamed Bin Zayed Univ, Abu Dhabi, U Arab Emirates
[2] Australian Natl Univ, Canberra, ACT, Australia
[3] Linkoping Univ, Linkoping, Sweden
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training. We establish a bridge between the above two object-alignment strategies via a novel weight transfer function that aggregates their complimentary strengths. In essence, the proposed model seeks to minimize the gap between object and image-centric representations in the OVD setting. On the COCO benchmark, our proposed approach achieves 36.6 AP(50) on novel classes, an absolute 8.2 gain over the previous best performance. For LVIS, we surpass the state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall. Code: https://github.com/hanoonaR/object-centric-ovd.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Simple Image-Level Classification Improves Open-Vocabulary Object Detection
    Fang, Ruohuan
    Pang, Guansong
    Bai, Xiao
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1716 - 1725
  • [2] Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
    Ghiasi, Golnaz
    Gu, Xiuye
    Cui, Yin
    Lin, Tsung-Yi
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 540 - 557
  • [3] Open-Vocabulary Object Detection With an Open Corpus
    Wang, Jiong
    Zhang, Huiming
    Hong, Haiwen
    Jin, Xuan
    He, Yuan
    Xue, Hui
    Zhao, Zhou
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6736 - 6746
  • [4] Scaling Open-Vocabulary Object Detection
    Minderer, Matthias
    Gritsenko, Alexey
    Houlsby, Neil
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Simple Open-Vocabulary Object Detection
    Minderer, Matthias
    Gritsenko, Alexey
    Stone, Austin
    Neumann, Maxim
    Weissenborn, Dirk
    Dosovitskiy, Alexey
    Mahendran, Aravindh
    Arnab, Anurag
    Dehghani, Mostafa
    Shen, Zhuoran
    Wang, Xiao
    Zhai, Xiaohua
    Kipf, Thomas
    Houlsby, Neil
    COMPUTER VISION, ECCV 2022, PT X, 2022, 13670 : 728 - 755
  • [6] Open-Vocabulary Object Detection Using Captions
    Zareian, Alireza
    Dela Rosa, Kevin
    Hu, Derek Hao
    Chang, Shih-Fu
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14388 - 14397
  • [7] Weakly Supervised Open-Vocabulary Object Detection
    Lin, Jianghang
    Shen, Yunhang
    Wang, Bingquan
    Lin, Shaohui
    Li, Ke
    Cao, Liujuan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3404 - 3412
  • [8] Aligning Bag of Regions for Open-Vocabulary Object Detection
    Wu, Size
    Zhang, Wenwei
    Jin, Sheng
    Liu, Wentao
    Loy, Chen Change
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15254 - 15264
  • [9] Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection
    Wang, Luting
    Liu, Yi
    Du, Penghui
    Ding, Zihan
    Liao, Yue
    Qi, Qiaosong
    Chen, Biaolong
    Liu, Si
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11186 - 11196
  • [10] Understanding object descriptions in robotics by open-vocabulary object retrieval and detection
    Guadarrama, Sergio
    Rodner, Erik
    Saenko, Kate
    Darrell, Trevor
    INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2016, 35 (1-3): : 265 - 280