Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models

被引：10

作者：

Ma, Chengcheng ^{[1
,2
]}

Liu, Yang ^{[3
]}

Deng, Jiankang ^{[4
]}

Xie, Lingxi ^{[4
]}

Dong, Weiming ^{[1
]}

Xu, Changsheng ^{[1
]}

机构：

[1] Chinese Acad Sci CASIA, Inst Automat, Natl Lab Pattern Recognit NLPR, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing 100049, Peoples R China

[3] Alibaba DAMO Acad, Hangzhou 310024, Peoples R China

[4] Huawei Inc, Shenzhen 518129, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 09期

基金：

北京市自然科学基金; 美国国家科学基金会;

关键词：

Vision-language model; prompt tuning; over-fitting; subspace learning; gradient projection;

D O I：

10.1109/TCSVT.2023.3245584

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Pretrained vision-language models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous prompts using task-specific training data. Despite the performance improvements on downstream tasks, several studies have reported that CoOp suffers from the overfitting issue in two aspects: (i) the test accuracy on base classes first improves and then worsens during training; (ii) the test accuracy on novel classes keeps decreasing. However, none of the existing studies can understand and mitigate such overfitting problems. In this study, we first explore the cause of overfitting by analyzing the gradient flow. Comparative experiments reveal that CoOp favors generalizable and spurious features in the early and later training stages, respectively, leading to the non-overfitting and overfitting phenomena. Given those observations, we propose Subspace Prompt Tuning (Sub PT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process and successfully eliminate the overfitting problem. In addition, we equip CoOp with a Novel Feature Learner (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set, needless of image training data. Extensive experiments on 11 classification datasets demonstrate that Sub PT+NFL consistently boost the performance of CoOp and outperform the state-of-the-art CoCoOp approach. Experiments on more challenging vision downstream tasks, including open-vocabulary object detection and zero-shot semantic segmentation, also verify the effectiveness of the proposed method. Codes can be found at https://tinyurl.com/mpe64f89.

引用

页码：4616 / 4629

页数：14

共 50 条

[41] Debiasing vision-language models for vision tasks: a survey
Zhu, Beier
Zhang, Hanwang
Frontiers of Computer Science, 2025, 19 (01)
[42] Multiple Prompt Fusion for Zero-Shot Lesion Detection Using Vision-Language Models
Guo, Miaotian
Yi, Huahui
Qin, Ziyuan
Wang, Haiying
Men, Aidong
Lao, Qicheng
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 283 - 292
[43] Exploring Vision-Language Models for Imbalanced Learning
Wang Y.
Yu Z.
Wang J.
Heng Q.
Chen H.
Ye W.
Xie R.
Xie X.
Zhang S.
International Journal of Computer Vision, 2024, 132 (01) : 224 - 237
[44] Vision-Language Models for Robot Success Detection
Luo, Fiona
THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23750 - 23752
[45] Unsupervised Prototype Adapter for Vision-Language Models
Zhang, Yi
Zhang, Ce
Hu, Xueting
He, Zhihai
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 197 - 209
[46] Task Bias in Contrastive Vision-Language Models
Menon, Sachit
Chandratreya, Ishaan Preetam
Vondrick, Carl
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
[47] Perceptual Grouping in Contrastive Vision-Language Models
Ranasinghe, Kanchana
McKinzie, Brandon
Ravi, Sachin
Yang, Yinfei
Toshev, Alexander
Shlens, Jonathon
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5548 - 5561
[48] Adventures of Trustworthy Vision-Language Models: A Survey
Vatsa, Mayank
Jain, Anubhooti
Singh, Richa
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
[49] Equivariant Similarity for Vision-Language Foundation Models
Wang, Tan
Lin, Kevin
Li, Linjie
Lin, Chung-Ching
Yang, Zhengyuan
Zhang, Hanwang
Liu, Zicheng
Wang, Lijuan
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11964 - 11974
[50] UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models
Li, Xin
Behpour, Sima
Doan, Thang
He, Wenbin
Gou, Liang
Ren, Liu
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,

← 1 2 3 4 5 →