Compositional Kronecker Context Optimization for vision-language models

被引:0
|
作者
Ding, Kun [1 ]
Li, Xiaohui [1 ]
Yu, Qiang [3 ]
Wang, Ying [1 ]
Zhang, Haojian [2 ]
Xiang, Shiming [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Engn Lab Intelligent Ind Vis, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Automat, Res Ctr Aerosp Informat, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Vision-language models; Prompt tuning; Structural context optimization; Few-shot image recognition; SPARSE REPRESENTATION; CLASSIFICATION;
D O I
10.1016/j.neucom.2024.128421
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision- language models to downstream image recognition tasks. Nevertheless, learning context with satisfactory base-to-new, domain and cross-task generalization ability simultaneously while adapting to new tasks is a challenge. To tackle such a challenge, existing methods mainly exploit knowledge distillation with auxiliary text data written by human experts. However, we instead explore a new technique route by structuring the prompts without resorting to extra text data. As a result, we obtain a new lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CKCoOp are learnable vectors, which are crafted by linearly combining base vectors from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data and the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that, compared with existing methods, CKCoOp can not only achieve comparable or even better performance under base-to-new, domain and cross-task generalization evaluation without the help of auxiliary text data, but also has the merits of fewer learnable parameters and efficient training and inference speed.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] ECO: Ensembling Context Optimization for Vision-Language Models
    Agnolucci, Lorenzo
    Baldrati, Alberto
    Todino, Francesco
    Becattini, Federico
    Bertini, Marco
    Del Bimbo, Alberto
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2803 - 2807
  • [2] Linear Spaces of Meanings: Compositional Structures in Vision-Language Models
    Trager, Matthew
    Perera, Pramuditha
    Zancato, Luca
    Achille, Alessandro
    Bhatia, Parminder
    Soatto, Stefano
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15349 - 15358
  • [3] MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
    Monajatipoor, Masoud
    Li, Liunian Harold
    Rouhsedaghat, Mozhdeh
    Yang, Lin F.
    Chang, Kai-Wei
    [J]. 61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 495 - 508
  • [4] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [5] Integrating advanced vision-language models for context recognition in risks assessment
    Rodriguez-Juan, Javier
    Ortiz-Perez, David
    Garcia-Rodriguez, Jose
    Tomás, David
    J.Nalepa, Grzegorz
    [J]. Neurocomputing, 2025, 618
  • [6] Learning to Prompt for Vision-Language Models
    Kaiyang Zhou
    Jingkang Yang
    Chen Change Loy
    Ziwei Liu
    [J]. International Journal of Computer Vision, 2022, 130 : 2337 - 2348
  • [7] Learning to Prompt for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
  • [8] VISION-LANGUAGE MODELS AS SUCCESS DETECTORS
    Du, Yuqing
    Konyushkova, Ksenia
    Denil, Misha
    Raju, Akhil
    Landon, Jessica
    Hill, Felix
    de Freitas, Nando
    Cabi, Serkan
    [J]. CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 120 - 136
  • [9] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    [J]. Frontiers of Computer Science, 2025, 19 (01)
  • [10] Vision-Language Models for Robot Success Detection
    Luo, Fiona
    [J]. THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23750 - 23752