Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

被引:0
|
作者
Liu, Mingyue [1 ]
Zhao, Honggang [1 ]
Ma, Longfei [1 ]
Li, Mingyong [1 ,2 ]
机构
[1] Chongqing Normal Univ, Coll Comp & Informat Sci, Chongqing 401331, Peoples R China
[2] Chongqing Natl Ctr Appl Math, Chongqing 401331, Peoples R China
关键词
Modal interaction; CLIP; Prompt learning; Self-attention;
D O I
10.1007/s13735-023-00287-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the current multimodal retrieval field, CoOp is the preferred approach among many models due to its simplicity and powerful adaptive capability. However, CoOp focuses primarily on optimizing prompts to perform contrast learning, without considering image-text interactions and the impact on the model when visual information is incorporated into the prompts. In this work, we propose a prompt tuning method for simulating image-text interaction based on CoOp: Decoding context optimization (DeCoOp). Through extensive experiments on 11 image classification datasets, seven datasets under the few-shot setting and all 11 datasets under the zero-shot setting are ahead of CoOp in our method. Experiments on four target datasets of ImageNet show a model performance improvement of more than 10%, demonstrating that our approach substantially outperforms the baseline model CoOp in terms of point domain generalization and robustness. In addition, ablation experiments performed on three representative datasets confirmed the effectiveness and further improvement of the accuracy of DeCoOp. Finally, experiments are performed on 11 datasets using different visual backbones, and it is not difficult to find that the gap between our approach and handcrafted prompts is large in all architectures and shows better performance than CoOp.
引用
下载
收藏
页数:10
相关论文
共 50 条
  • [41] Transformer vision-language tracking via proxy token guided cross-modal fusion
    Zhao, Haojie
    Wang, Xiao
    Wang, Dong
    Lu, Huchuan
    Ruan, Xiang
    PATTERN RECOGNITION LETTERS, 2023, 168 : 10 - 16
  • [42] VISION-LANGUAGE MODELS AS SUCCESS DETECTORS
    Du, Yuqing
    Konyushkova, Ksenia
    Denil, Misha
    Raju, Akhil
    Landon, Jessica
    Hill, Felix
    de Freitas, Nando
    Cabi, Serkan
    CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 120 - 136
  • [43] MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
    Monajatipoor, Masoud
    Li, Liunian Harold
    Rouhsedaghat, Mozhdeh
    Yang, Lin F.
    Chang, Kai-Wei
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 495 - 508
  • [44] Pre-training A Prompt Pool for Vision-Language Model
    Liu, Jun
    Gu, Yang
    Yang, Zhaohua
    Guo, Shuai
    Liu, Huaqiu
    Chen, Yiqiang
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [45] Vision-Language Transformer and Query Generation for Referring Segmentation
    Ding, Henghui
    Liu, Chang
    Wang, Suchen
    Jiang, Xudong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 16301 - 16310
  • [46] Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification
    Xuan, Yunyi
    Chen, Weijie
    Yang, Shicai
    Xie, Di
    Lin, Luojun
    Zhuang, Yueting
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4928 - 4938
  • [47] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    Frontiers of Computer Science, 2025, 19 (01)
  • [48] Multiple Prompt Fusion for Zero-Shot Lesion Detection Using Vision-Language Models
    Guo, Miaotian
    Yi, Huahui
    Qin, Ziyuan
    Wang, Haiying
    Men, Aidong
    Lao, Qicheng
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 283 - 292
  • [49] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
    Shu, Manli
    Nie, Weili
    Huang, De-An
    Yu, Zhiding
    Goldstein, Tom
    Anandkumar, Anima
    Xiao, Chaowei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [50] LiFT: Transfer Learning in Vision-Language Models for Downstream Adaptation and Generalization
    Li, Jingzheng
    Sun, Hailong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4678 - 4687