Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

被引:0
|
作者
Liu, Mingyue [1 ]
Zhao, Honggang [1 ]
Ma, Longfei [1 ]
Li, Mingyong [1 ,2 ]
机构
[1] Chongqing Normal Univ, Coll Comp & Informat Sci, Chongqing 401331, Peoples R China
[2] Chongqing Natl Ctr Appl Math, Chongqing 401331, Peoples R China
关键词
Modal interaction; CLIP; Prompt learning; Self-attention;
D O I
10.1007/s13735-023-00287-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the current multimodal retrieval field, CoOp is the preferred approach among many models due to its simplicity and powerful adaptive capability. However, CoOp focuses primarily on optimizing prompts to perform contrast learning, without considering image-text interactions and the impact on the model when visual information is incorporated into the prompts. In this work, we propose a prompt tuning method for simulating image-text interaction based on CoOp: Decoding context optimization (DeCoOp). Through extensive experiments on 11 image classification datasets, seven datasets under the few-shot setting and all 11 datasets under the zero-shot setting are ahead of CoOp in our method. Experiments on four target datasets of ImageNet show a model performance improvement of more than 10%, demonstrating that our approach substantially outperforms the baseline model CoOp in terms of point domain generalization and robustness. In addition, ablation experiments performed on three representative datasets confirmed the effectiveness and further improvement of the accuracy of DeCoOp. Finally, experiments are performed on 11 datasets using different visual backbones, and it is not difficult to find that the gap between our approach and handcrafted prompts is large in all architectures and shows better performance than CoOp.
引用
下载
收藏
页数:10
相关论文
共 50 条
  • [1] Modal Interaction-Enhanced Prompt Learning by Transformer Decoder for Vision-Language Models
    Liu, Mingyue
    Zhao, Honggang
    Ma, Longfei
    Li, Xiang
    Ji, Yucheng
    Li, Mingyong
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT IV, KSEM 2023, 2023, 14120 : 163 - 174
  • [2] Modal interaction-enhanced prompt learning by transformer decoder for vision-language models
    Mingyue Liu
    Honggang Zhao
    Longfei Ma
    Mingyong Li
    International Journal of Multimedia Information Retrieval, 2023, 12
  • [3] Learning to Prompt for Vision-Language Models
    Kaiyang Zhou
    Jingkang Yang
    Chen Change Loy
    Ziwei Liu
    International Journal of Computer Vision, 2022, 130 : 2337 - 2348
  • [4] Learning to Prompt for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
  • [5] Conditional Prompt Learning for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16795 - 16804
  • [6] Learning Domain Invariant Prompt for Vision-Language Models
    Zhao, Cairong
    Wang, Yubin
    Jiang, Xinyang
    Shen, Yifei
    Song, Kaitao
    Li, Dongsheng
    Miao, Duoqian
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
  • [7] Concept-Guided Prompt Learning for Generalization in Vision-Language Models
    Zhang, Yi
    Zhang, Ce
    Yu, Ke
    Tang, Yushun
    He, Zhihai
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7377 - 7386
  • [8] Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
    Wang, Yubin
    Jiang, Xinyang
    Cheng, De
    Li, Dongsheng
    Zhao, Cairong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5749 - 5757
  • [9] Learning to Prompt for Vision-Language Emotion Recognition
    Xie, Hongxia
    Chung, Hua
    Shuai, Hong-Han
    Cheng, Wen-Huang
    2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2023,
  • [10] MCPL: Multi-modal Collaborative Prompt Learning for Medical Vision-Language Model
    Wang P.
    Zhang H.
    Yuan Y.
    IEEE Transactions on Medical Imaging, 2024, 43 (12) : 1 - 1