Compositional Kronecker Context Optimization for vision-language models

被引：0

作者：

Ding, Kun ^{[1
]}

Li, Xiaohui ^{[1
]}

Yu, Qiang ^{[3
]}

Wang, Ying ^{[1
]}

Zhang, Haojian ^{[2
]}

Xiang, Shiming ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Automat, Engn Lab Intelligent Ind Vis, Beijing, Peoples R China

[3] Chinese Acad Sci, Inst Automat, Res Ctr Aerosp Informat, Beijing, Peoples R China

来源：

NEUROCOMPUTING | 2024年 / 608卷

基金：

中国国家自然科学基金;

关键词：

Vision-language models; Prompt tuning; Structural context optimization; Few-shot image recognition; SPARSE REPRESENTATION; CLASSIFICATION;

D O I：

10.1016/j.neucom.2024.128421

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision- language models to downstream image recognition tasks. Nevertheless, learning context with satisfactory base-to-new, domain and cross-task generalization ability simultaneously while adapting to new tasks is a challenge. To tackle such a challenge, existing methods mainly exploit knowledge distillation with auxiliary text data written by human experts. However, we instead explore a new technique route by structuring the prompts without resorting to extra text data. As a result, we obtain a new lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CKCoOp are learnable vectors, which are crafted by linearly combining base vectors from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data and the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that, compared with existing methods, CKCoOp can not only achieve comparable or even better performance under base-to-new, domain and cross-task generalization evaluation without the help of auxiliary text data, but also has the merits of fewer learnable parameters and efficient training and inference speed.

引用

页数：11

共 50 条

[21] DeAR: Debiasing Vision-Language Models with Additive Residuals
Seth, Ashish
Hemani, Mayur
Agarwal, Chirag
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6820 - 6829
[22] Learning Domain Invariant Prompt for Vision-Language Models
Zhao, Cairong
Wang, Yubin
Jiang, Xinyang
Shen, Yifei
Song, Kaitao
Li, Dongsheng
Miao, Duoqian
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
[23] VinVL: Revisiting Visual Representations in Vision-Language Models
Zhang, Pengchuan
Li, Xiujun
Hu, Xiaowei
Yang, Jianwei
Zhang, Lei
Wang, Lijuan
Choi, Yejin
Gao, Jianfeng
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
[24] Effectiveness assessment of recent large vision-language models
Yao Jiang
Xinyu Yan
Ge-Peng Ji
Keren Fu
Meijun Sun
Huan Xiong
Deng-Ping Fan
Fahad Shahbaz Khan
[J]. Visual Intelligence, 2 (1):
[25] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
Salin, Emmanuelle
Ayache, Stephane
Favre, Benoit
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
[26] On Evaluating Adversarial Robustness of Large Vision-Language Models
Zhao, Yunqing
Pang, Tianyu
Du, Chao
Yang, Xiao
Li, Chongxuan
Cheung, Ngai-Man
Lin, Min
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[27] Adapting vision-language AI models to cardiology tasks
Arnaout, Rima
[J]. NATURE MEDICINE, 2024,
[28] ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion
Hu, Zhizhang
Zhu, Xinliang
Tran, Son
Vidal, Rene
Dhua, Arnab
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2764 - 2769
[29] ILLUME: Rationalizing Vision-Language Models through Human Interactions
Brack, Manuel
Schramowski, Patrick
Deiseroth, Bjorn
Kersting, Kristian
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
[30] Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
Wenhao Wu
Zhun Sun
Yuxin Song
Jingdong Wang
Wanli Ouyang
[J]. International Journal of Computer Vision, 2024, 132 (2) : 392 - 409

← 1 2 3 4 5 →