Robust Fine-Tuning of Vision-Language Models for Domain Generalization

被引:0
|
作者
Vogt-Lowell, Kevin [1 ]
Lee, Noah [2 ]
Tsiligkaridis, Theodoros [1 ]
Vaillant, Marc [2 ]
机构
[1] MIT Lincoln Lab, Artificial Intelligence Technol, Lexington, MA 02421 USA
[2] MIT Lincoln Lab, Homeland Sensors & Analyt, Lexington, MA USA
关键词
foundation model; vision-language model; CLIP; fine-tuning; distribution shift; out-of-distribution robustness;
D O I
10.1109/HPEC58863.2023.10363450
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Transfer learning enables the sharing of common knowledge among models for a variety of downstream tasks, but traditional methods suffer in limited training data settings and produce narrow models incapable of effectively generalizing under distribution shifts. Foundation models have recently demonstrated impressive zero-shot inference capabilities and robustness under distribution shifts. However, zero-shot evaluation for these models has been predominantly confined to benchmarks with simple distribution shifts, limiting our understanding of their effectiveness under the more realistic shifts found in practice. Moreover, common fine-tuning methods for these models have yet to be evaluated against vision models in few-shot scenarios where training data is limited. To address these gaps, we present a new recipe for few-shot fine-tuning of the popular vision-language foundation model CLIP and evaluate its performance on challenging benchmark datasets with realistic distribution shifts from the WILDS collection. Our experimentation demonstrates that, while zero-shot CLIP fails to match performance of trained vision models on more complex benchmarks, few-shot CLIP fine-tuning outperforms its vision-only counterparts in terms of in-distribution and out-of-distribution accuracy at all levels of training data availability. This provides a strong incentive for adoption of foundation models within few-shot learning applications operating with real-world data.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] A survey of efficient fine-tuning methods for Vision-Language Models - Prompt and Adapter
    Xing, Jialu
    Liu, Jianping
    Wang, Jian
    Sun, Lulu
    Chen, Xi
    Gu, Xunxun
    Wang, Yingfei
    [J]. COMPUTERS & GRAPHICS-UK, 2024, 119
  • [2] How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?
    Yifei Ming
    Yixuan Li
    [J]. International Journal of Computer Vision, 2024, 132 : 596 - 609
  • [3] How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?
    Ming, Yifei
    Li, Yixuan
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) : 596 - 609
  • [4] Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?
    Wu, Cheng-En
    Tian, Yu
    Yu, Haichao
    Wang, Heng
    Morgado, Pedro
    Hu, Yu Hen
    Yang, Linjie
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15442 - 15451
  • [5] Task Residual for Tuning Vision-Language Models
    Yu, Tao
    Lu, Zhihe
    Jin, Xin
    Chen, Zhibo
    Wang, Xinchao
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
  • [6] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
    Shu, Manli
    Nie, Weili
    Huang, De-An
    Yu, Zhiding
    Goldstein, Tom
    Anandkumar, Anima
    Xiao, Chaowei
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [7] Tuning Vision-Language Models With Multiple Prototypes Clustering
    Guo, Meng-Hao
    Zhang, Yi
    Mu, Tai-Jiang
    Huang, Sharon X.
    Hu, Shi-Min
    [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12) : 11186 - 11199
  • [8] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    Zong, Yongshuo
    Bohdal, Ondrej
    Yu, Tingyang
    Yang, Yongxin
    Hospedales, Timothy
    [J]. arXiv, 1600,
  • [9] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    Zong, Yongshuo
    Bohdal, Ondrej
    Yu, Tingyang
    Yang, Yongxin
    Hospedales, Timothy
    [J]. Proceedings of Machine Learning Research, 2024, 235 : 62867 - 62891
  • [10] Learning Domain Invariant Prompt for Vision-Language Models
    Zhao, Cairong
    Wang, Yubin
    Jiang, Xinyang
    Shen, Yifei
    Song, Kaitao
    Li, Dongsheng
    Miao, Duoqian
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360