Black Box Few-Shot Adaptation for Vision-Language models

被引:5
|
作者
Ouali, Yassine [1 ]
Bulat, Adrian [1 ]
Matinez, Brais [1 ]
Tzimiropoulos, Georgios [1 ,2 ]
机构
[1] Samsung AI Cambridge, Cambridge, England
[2] Queen Mary Univ London, London, England
关键词
SHAPE;
D O I
10.1109/ICCV51070.2023.01424
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaption aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a blackbox method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights, (b) it is orders of magnitude faster at training time, (c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni- modal models. To achieve this, we propose Linear Feature Alignment ( LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets. Code available at: https://github.com/saic-fi/LFA
引用
收藏
页码:15488 / 15500
页数:13
相关论文
共 50 条
  • [1] Few-Shot Adaptation of Medical Vision-Language Models
    Shakeri, Fereshteh
    Huang, Yunshi
    Silva-Rodriguez, Julio
    Bahig, Houda
    Tang, An
    Dolz, Jose
    Ben Ayed, Ismail
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XII, 2024, 15012 : 553 - 563
  • [2] A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
    Iguez, Julio Silva-Rodr
    Hajimiri, Sina
    Ben Ayed, Ismail
    Dolz, Jose
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23681 - 23690
  • [3] Few-Shot Image Classification of Crop Diseases Based on Vision-Language Models
    Zhou, Yueyue
    Yan, Hongping
    Ding, Kun
    Cai, Tingting
    Zhang, Yan
    SENSORS, 2024, 24 (18)
  • [4] Inference Calibration of Vision-Language Foundation Models for Zero-Shot and Few-Shot Learning
    Hu, Minyang
    Chang, Hong
    Shan, Shiguang
    Chen, Xilin
    PATTERN RECOGNITION LETTERS, 2025, 192 : 15 - 21
  • [5] MAPL : Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
    Manas, Oscar
    Rodriguez, Pau
    Ahmadi, Saba
    Nematzadeh, Aida
    Goyal, Yash
    Agrawal, Aishwarya
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2523 - 2548
  • [6] Meta-Adapter: An Online Few-shot Learner for Vision-Language Model
    Cheng, Cheng
    Song, Lin
    Xue, Ruoyi
    Wang, Hang
    Sun, Hongbin
    Ge, Yixiao
    Shan, Ying
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] KDNet: Leveraging Vision-Language Knowledge Distillation for Few-Shot Object Detection
    Ma, Mengyuan
    Qian, Lin
    Yin, Hujun
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II, 2024, 15017 : 153 - 167
  • [8] Read-only Prompt Optimization for Vision-Language Few-shot Learning
    Lee, Dongjun
    Song, Seokwon
    Suh, Jihee
    Choi, Joonmyeong
    Lee, Sanghyeok
    Kim, Hyunwoo J.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 1401 - 1411
  • [9] A Vision-language Model Based on Prompt Learner for Few-shot Medical Images Diagnosis
    Chang, Tianyou
    Chen, Shizhan
    Fan, Guodong
    Feng, Zhiyong
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1455 - 1460
  • [10] Language Models are Few-Shot Butlers
    Micheli, Vincent
    Fleuret, Francois
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9312 - 9318