A survey of efficient fine-tuning methods for Vision-Language Models - Prompt and Adapter

被引:1
|
作者
Xing, Jialu [1 ]
Liu, Jianping [1 ,2 ,4 ]
Wang, Jian [3 ]
Sun, Lulu [1 ]
Chen, Xi [1 ]
Gu, Xunxun [1 ]
Wang, Yingfei [1 ]
机构
[1] North Minzu Univ, Coll Comp Sci & Engn, Yinchuan 750021, Peoples R China
[2] North Minzu Univ, Key Lab Images & Grap Intelligent Proc, State Ethn Affairs Commiss, Yinchuan 750021, Peoples R China
[3] Chinese Acad Agr Sci, Agr Informat Inst, Beijing 100081, Peoples R China
[4] 204,Wenchang North St, Yinchuan, Ningxia, Peoples R China
来源
COMPUTERS & GRAPHICS-UK | 2024年 / 119卷
关键词
Vision-language; Computer vision; Efficient fine-tuning; Pre-training model; Prompt; Adapter;
D O I
10.1016/j.cag.2024.01.012
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision Language Model (VLM) is a popular research field located at the fusion of computer vision and natural language processing (NLP). With the emergence of transformer networks and mass web data, numerous large scale VLMs or Vision -Language Pre-training Models (VLPM) have been achieving state-of-the-art results in many tasks, such as retrieval (CLIP) and generation (DALL-E). Although large models have shown impressive results, the cost of retraining and full fine-tuning is prohibitive for general researchers. In recent years, Efficient fine-tuning (EFT) which a very low-cost tuning method has been a good solution to this problem has greatly alleviated this problem, and driven by this, a new fine-tuning paradigm has developed. Since Prompt and Adapter are most widely used in the field of visual language, this review focuses on analysing the progress of the application of these two methods. Firstly, we reviewed the VLM research paradigm based on the differences in pre-training-fine-tuning methods; Next, We categorized the Prompt into 3 types (7 subtypes) of usage patterns based on the different modal information, and categorized the Adapter into 2 types of usage patterns based on whether it plays a role in modal fusion, furthermore we discussed them in vision and vision-language tasks. Finally, we discussed the stability and social ethics of EFT, and possible future research directions were proposed.
引用
收藏
页数:23
相关论文
共 50 条
  • [31] CLIP-Adapter: Better Vision-Language Models with Feature Adapters
    Gao, Peng
    Geng, Shijie
    Zhang, Renrui
    Ma, Teli
    Fang, Rongyao
    Zhang, Yongfeng
    Li, Hongsheng
    Qiao, Yu
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) : 581 - 595
  • [32] CLIP-Adapter: Better Vision-Language Models with Feature Adapters
    Peng Gao
    Shijie Geng
    Renrui Zhang
    Teli Ma
    Rongyao Fang
    Yongfeng Zhang
    Hongsheng Li
    Yu Qiao
    International Journal of Computer Vision, 2024, 132 (2) : 581 - 595
  • [33] Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
    Xing, Yinghui
    Wu, Qirui
    Cheng, De
    Zhang, Shizhou
    Liang, Guoqiang
    Wang, Peng
    Zhang, Yanning
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2056 - 2068
  • [34] Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning
    Ding, Kun
    Zhang, Haojian
    Yu, Qiang
    Wang, Ying
    Xiang, Shiming
    Pan, Chunhong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1528 - 1536
  • [35] Fine-tuning and prompt engineering for large language models-based code review automation
    Pornprasit, Chanathip
    Tantithamthavorn, Chakkrit
    INFORMATION AND SOFTWARE TECHNOLOGY, 2024, 175
  • [36] Constraint embedding for prompt tuning in vision-language pre-trained modelConstraint embedding for prompt tuning in vision-language pre-trained modelK. Cheng et al.
    Keyang Cheng
    Liutao Wei
    Jingfeng Tang
    Yongzhao Zhan
    Multimedia Systems, 2025, 31 (1)
  • [37] Concept-Guided Prompt Learning for Generalization in Vision-Language Models
    Zhang, Yi
    Zhang, Ce
    Yu, Ke
    Tang, Yushun
    He, Zhihai
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7377 - 7386
  • [38] Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
    Wang, Yubin
    Jiang, Xinyang
    Cheng, De
    Li, Dongsheng
    Zhao, Cairong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5749 - 5757
  • [39] SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models
    Ma, Xiaosong
    Zhang, Jie
    Guo, Song
    Xu, Wenchao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [40] Personalized Large Language Models through Parameter Efficient Fine-Tuning Techniques
    Braga, Marco
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 3076 - 3076