Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

被引:0
|
作者
Liu, Mingyue [1 ]
Zhao, Honggang [1 ]
Ma, Longfei [1 ]
Li, Mingyong [1 ,2 ]
机构
[1] Chongqing Normal Univ, Coll Comp & Informat Sci, Chongqing 401331, Peoples R China
[2] Chongqing Natl Ctr Appl Math, Chongqing 401331, Peoples R China
关键词
Modal interaction; CLIP; Prompt learning; Self-attention;
D O I
10.1007/s13735-023-00287-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the current multimodal retrieval field, CoOp is the preferred approach among many models due to its simplicity and powerful adaptive capability. However, CoOp focuses primarily on optimizing prompts to perform contrast learning, without considering image-text interactions and the impact on the model when visual information is incorporated into the prompts. In this work, we propose a prompt tuning method for simulating image-text interaction based on CoOp: Decoding context optimization (DeCoOp). Through extensive experiments on 11 image classification datasets, seven datasets under the few-shot setting and all 11 datasets under the zero-shot setting are ahead of CoOp in our method. Experiments on four target datasets of ImageNet show a model performance improvement of more than 10%, demonstrating that our approach substantially outperforms the baseline model CoOp in terms of point domain generalization and robustness. In addition, ablation experiments performed on three representative datasets confirmed the effectiveness and further improvement of the accuracy of DeCoOp. Finally, experiments are performed on 11 datasets using different visual backbones, and it is not difficult to find that the gap between our approach and handcrafted prompts is large in all architectures and shows better performance than CoOp.
引用
下载
收藏
页数:10
相关论文
共 50 条
  • [21] Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?
    Wu, Cheng-En
    Tian, Yu
    Yu, Haichao
    Wang, Heng
    Morgado, Pedro
    Hu, Yu Hen
    Yang, Linjie
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15442 - 15451
  • [22] A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
    Jin, Woojeong
    Cheng, Yu
    Shen, Yelong
    Chen, Weizhu
    Ren, Xiang
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2763 - 2775
  • [23] SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models
    Ma, Xiaosong
    Zhang, Jie
    Guo, Song
    Xu, Wenchao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] VLDeformer: Vision-Language Decomposed Transformer for fast cross-modal retrieval
    Zhang, Lisai
    Wu, Hongfa
    Chen, Qingcai
    Deng, Yimeng
    Siebert, Joanna
    Li, Zhonghua
    Han, Yunpeng
    Kong, Dejiang
    Cao, Zhao
    KNOWLEDGE-BASED SYSTEMS, 2022, 252
  • [25] Masked Vision-language Transformer in Fashion
    Ji, Ge-Peng
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Sakaridis, Christos
    Gool, Luc Van
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (03) : 421 - 434
  • [26] Masked Vision-language Transformer in Fashion
    Ge-Peng Ji
    Mingchen Zhuge
    Dehong Gao
    Deng-Ping Fan
    Christos Sakaridis
    Luc Van Gool
    Machine Intelligence Research, 2023, 20 : 421 - 434
  • [27] TVLT: Textless Vision-Language Transformer
    Tang, Zineng
    Cho, Jaemin
    Nie, Yixin
    Bansal, Mohit
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [28] UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models
    Li, Xin
    Behpour, Sima
    Doan, Thang
    He, Wenbin
    Gou, Liang
    Ren, Liu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [29] Vision-Language Consistency Guided Multi-Modal Prompt Learning for Blind AI Generated Image Quality Assessment
    Fu, Jun
    Zhou, Wei
    Jiang, Qiuping
    Liu, Hantao
    Zhai, Guangtao
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1820 - 1824
  • [30] CTPT: Continual Test-time Prompt Tuning for vision-language models
    Wang, Fan
    Han, Zhongyi
    Liu, Xingbo
    Yin, Yilong
    Gao, Xin
    Pattern Recognition, 2025, 161