A survey of efficient fine-tuning methods for Vision-Language Models - Prompt and Adapter

被引:1
|
作者
Xing, Jialu [1 ]
Liu, Jianping [1 ,2 ,4 ]
Wang, Jian [3 ]
Sun, Lulu [1 ]
Chen, Xi [1 ]
Gu, Xunxun [1 ]
Wang, Yingfei [1 ]
机构
[1] North Minzu Univ, Coll Comp Sci & Engn, Yinchuan 750021, Peoples R China
[2] North Minzu Univ, Key Lab Images & Grap Intelligent Proc, State Ethn Affairs Commiss, Yinchuan 750021, Peoples R China
[3] Chinese Acad Agr Sci, Agr Informat Inst, Beijing 100081, Peoples R China
[4] 204,Wenchang North St, Yinchuan, Ningxia, Peoples R China
来源
COMPUTERS & GRAPHICS-UK | 2024年 / 119卷
关键词
Vision-language; Computer vision; Efficient fine-tuning; Pre-training model; Prompt; Adapter;
D O I
10.1016/j.cag.2024.01.012
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision Language Model (VLM) is a popular research field located at the fusion of computer vision and natural language processing (NLP). With the emergence of transformer networks and mass web data, numerous large scale VLMs or Vision -Language Pre-training Models (VLPM) have been achieving state-of-the-art results in many tasks, such as retrieval (CLIP) and generation (DALL-E). Although large models have shown impressive results, the cost of retraining and full fine-tuning is prohibitive for general researchers. In recent years, Efficient fine-tuning (EFT) which a very low-cost tuning method has been a good solution to this problem has greatly alleviated this problem, and driven by this, a new fine-tuning paradigm has developed. Since Prompt and Adapter are most widely used in the field of visual language, this review focuses on analysing the progress of the application of these two methods. Firstly, we reviewed the VLM research paradigm based on the differences in pre-training-fine-tuning methods; Next, We categorized the Prompt into 3 types (7 subtypes) of usage patterns based on the different modal information, and categorized the Adapter into 2 types of usage patterns based on whether it plays a role in modal fusion, furthermore we discussed them in vision and vision-language tasks. Finally, we discussed the stability and social ethics of EFT, and possible future research directions were proposed.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] Robust Fine-Tuning of Vision-Language Models for Domain Generalization
    Vogt-Lowell, Kevin
    Lee, Noah
    Tsiligkaridis, Theodoros
    Vaillant, Marc
    2023 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE, HPEC, 2023,
  • [2] Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
    Ma, Chengcheng
    Liu, Yang
    Deng, Jiankang
    Xie, Lingxi
    Dong, Weiming
    Xu, Changsheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4616 - 4629
  • [3] Distribution-Aware Prompt Tuning for Vision-Language Models
    Cho, Eulrang
    Kim, Jooyeon
    Kim, Hyunwoo J.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21947 - 21956
  • [4] Prompt-Ladder: Memory-efficient prompt tuning for vision-language models on edge devices
    Cai, Siqi
    Liu, Xuan
    Yuan, Jingling
    Zhou, Qihua
    Pattern Recognition, 2025, 163
  • [5] How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?
    Yifei Ming
    Yixuan Li
    International Journal of Computer Vision, 2024, 132 : 596 - 609
  • [6] How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?
    Ming, Yifei
    Li, Yixuan
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) : 596 - 609
  • [7] Learning to Prompt for Vision-Language Models
    Kaiyang Zhou
    Jingkang Yang
    Chen Change Loy
    Ziwei Liu
    International Journal of Computer Vision, 2022, 130 : 2337 - 2348
  • [8] Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
    Kan, Baoshuo
    Wang, Teng
    Lu, Wenpeng
    Zhen, Xiantong
    Guan, Weili
    Zheng, Feng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15624 - 15634
  • [9] Learning to Prompt for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
  • [10] Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?
    Wu, Cheng-En
    Tian, Yu
    Yu, Haichao
    Wang, Heng
    Morgado, Pedro
    Hu, Yu Hen
    Yang, Linjie
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15442 - 15451