A survey of efficient fine-tuning methods for Vision-Language Models - Prompt and Adapter

被引:1
|
作者
Xing, Jialu [1 ]
Liu, Jianping [1 ,2 ,4 ]
Wang, Jian [3 ]
Sun, Lulu [1 ]
Chen, Xi [1 ]
Gu, Xunxun [1 ]
Wang, Yingfei [1 ]
机构
[1] North Minzu Univ, Coll Comp Sci & Engn, Yinchuan 750021, Peoples R China
[2] North Minzu Univ, Key Lab Images & Grap Intelligent Proc, State Ethn Affairs Commiss, Yinchuan 750021, Peoples R China
[3] Chinese Acad Agr Sci, Agr Informat Inst, Beijing 100081, Peoples R China
[4] 204,Wenchang North St, Yinchuan, Ningxia, Peoples R China
来源
COMPUTERS & GRAPHICS-UK | 2024年 / 119卷
关键词
Vision-language; Computer vision; Efficient fine-tuning; Pre-training model; Prompt; Adapter;
D O I
10.1016/j.cag.2024.01.012
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision Language Model (VLM) is a popular research field located at the fusion of computer vision and natural language processing (NLP). With the emergence of transformer networks and mass web data, numerous large scale VLMs or Vision -Language Pre-training Models (VLPM) have been achieving state-of-the-art results in many tasks, such as retrieval (CLIP) and generation (DALL-E). Although large models have shown impressive results, the cost of retraining and full fine-tuning is prohibitive for general researchers. In recent years, Efficient fine-tuning (EFT) which a very low-cost tuning method has been a good solution to this problem has greatly alleviated this problem, and driven by this, a new fine-tuning paradigm has developed. Since Prompt and Adapter are most widely used in the field of visual language, this review focuses on analysing the progress of the application of these two methods. Firstly, we reviewed the VLM research paradigm based on the differences in pre-training-fine-tuning methods; Next, We categorized the Prompt into 3 types (7 subtypes) of usage patterns based on the different modal information, and categorized the Adapter into 2 types of usage patterns based on whether it plays a role in modal fusion, furthermore we discussed them in vision and vision-language tasks. Finally, we discussed the stability and social ethics of EFT, and possible future research directions were proposed.
引用
收藏
页数:23
相关论文
共 50 条
  • [21] Task Residual for Tuning Vision-Language Models
    Yu, Tao
    Lu, Zhihe
    Jin, Xin
    Chen, Zhibo
    Wang, Xinchao
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
  • [22] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
    Shu, Manli
    Nie, Weili
    Huang, De-An
    Yu, Zhiding
    Goldstein, Tom
    Anandkumar, Anima
    Xiao, Chaowei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [23] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    Frontiers of Computer Science, 2025, 19 (01)
  • [24] Black-box Prompt Tuning for Vision-Language Model as a Service
    Yu, Lang
    Chen, Qin
    Lin, Jiaju
    He, Liang
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1686 - 1694
  • [25] Prompt Engineering or Fine-Tuning? A Case Study on Phishing Detection with Large Language Models
    Trad, Fouad
    Chehab, Ali
    MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2024, 6 (01): : 367 - 384
  • [26] Tuning Vision-Language Models With Multiple Prototypes Clustering
    Guo, Meng-Hao
    Zhang, Yi
    Mu, Tai-Jiang
    Huang, Sharon X.
    Hu, Shi-Min
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12) : 11186 - 11199
  • [27] Democratizing protein language models with parameter-efficient fine-tuning
    Sledzieski, Samuel
    Kshirsagar, Meghana
    Baek, Minkyung
    Dodhia, Rahul
    Ferres, Juan Lavista
    Berger, Bonnie
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (26)
  • [28] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    Zong, Yongshuo
    Bohdal, Ondrej
    Yu, Tingyang
    Yang, Yongxin
    Hospedales, Timothy
    arXiv, 1600,
  • [29] Adventures of Trustworthy Vision-Language Models: A Survey
    Vatsa, Mayank
    Jain, Anubhooti
    Singh, Richa
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
  • [30] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    Zong, Yongshuo
    Bohdal, Ondrej
    Yu, Tingyang
    Yang, Yongxin
    Hospedales, Timothy
    Proceedings of Machine Learning Research, 2024, 235 : 62867 - 62891