Compositional Kronecker Context Optimization for vision-language models

被引:0
|
作者
Ding, Kun [1 ]
Li, Xiaohui [1 ]
Yu, Qiang [3 ]
Wang, Ying [1 ]
Zhang, Haojian [2 ]
Xiang, Shiming [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Engn Lab Intelligent Ind Vis, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Automat, Res Ctr Aerosp Informat, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Vision-language models; Prompt tuning; Structural context optimization; Few-shot image recognition; SPARSE REPRESENTATION; CLASSIFICATION;
D O I
10.1016/j.neucom.2024.128421
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision- language models to downstream image recognition tasks. Nevertheless, learning context with satisfactory base-to-new, domain and cross-task generalization ability simultaneously while adapting to new tasks is a challenge. To tackle such a challenge, existing methods mainly exploit knowledge distillation with auxiliary text data written by human experts. However, we instead explore a new technique route by structuring the prompts without resorting to extra text data. As a result, we obtain a new lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CKCoOp are learnable vectors, which are crafted by linearly combining base vectors from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data and the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that, compared with existing methods, CKCoOp can not only achieve comparable or even better performance under base-to-new, domain and cross-task generalization evaluation without the help of auxiliary text data, but also has the merits of fewer learnable parameters and efficient training and inference speed.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] DeAR: Debiasing Vision-Language Models with Additive Residuals
    Seth, Ashish
    Hemani, Mayur
    Agarwal, Chirag
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6820 - 6829
  • [22] Learning Domain Invariant Prompt for Vision-Language Models
    Zhao, Cairong
    Wang, Yubin
    Jiang, Xinyang
    Shen, Yifei
    Song, Kaitao
    Li, Dongsheng
    Miao, Duoqian
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
  • [23] VinVL: Revisiting Visual Representations in Vision-Language Models
    Zhang, Pengchuan
    Li, Xiujun
    Hu, Xiaowei
    Yang, Jianwei
    Zhang, Lei
    Wang, Lijuan
    Choi, Yejin
    Gao, Jianfeng
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
  • [24] Effectiveness assessment of recent large vision-language models
    Yao Jiang
    Xinyu Yan
    Ge-Peng Ji
    Keren Fu
    Meijun Sun
    Huan Xiong
    Deng-Ping Fan
    Fahad Shahbaz Khan
    [J]. Visual Intelligence, 2 (1):
  • [25] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
    Salin, Emmanuelle
    Ayache, Stephane
    Favre, Benoit
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
  • [26] On Evaluating Adversarial Robustness of Large Vision-Language Models
    Zhao, Yunqing
    Pang, Tianyu
    Du, Chao
    Yang, Xiao
    Li, Chongxuan
    Cheung, Ngai-Man
    Lin, Min
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [27] Adapting vision-language AI models to cardiology tasks
    Arnaout, Rima
    [J]. NATURE MEDICINE, 2024,
  • [28] ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion
    Hu, Zhizhang
    Zhu, Xinliang
    Tran, Son
    Vidal, Rene
    Dhua, Arnab
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2764 - 2769
  • [29] ILLUME: Rationalizing Vision-Language Models through Human Interactions
    Brack, Manuel
    Schramowski, Patrick
    Deiseroth, Bjorn
    Kersting, Kristian
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [30] Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
    Wenhao Wu
    Zhun Sun
    Yuxin Song
    Jingdong Wang
    Wanli Ouyang
    [J]. International Journal of Computer Vision, 2024, 132 (2) : 392 - 409