Text encoders bottleneck compositionality in contrastive vision-language models

被引:0
|
作者
Kamath, Amita [1 ]
Hessel, Jack [2 ]
Chang, Kai-Wei [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90024 USA
[2] Allen Inst AI, Seattle, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach does not require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on more compositional inputs, including object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multimodal matching performance on ControlledImCaps: a new evaluation benchmark we collect and release consisting of fine-grained compositional images and captions. Specifically, our results suggest textonly recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive VL models. We release our datasets and code.
引用
收藏
页码:4933 / 4944
页数:12
相关论文
共 50 条
  • [21] Vision-Language Pre-Training with Triple Contrastive Learning
    Yang, Jinyu
    Duan, Jiali
    Tran, Son
    Xu, Yi
    Chanda, Sampath
    Chen, Liqun
    Zeng, Belinda
    Chilimbi, Trishul
    Huang, Junzhou
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15650 - 15659
  • [22] Contrastive Instruction-Trajectory Learning for Vision-Language Navigation
    Liang, Xiwen
    Zhu, Fengda
    Zhu, Yi
    Lin, Bingqian
    Wang, Bing
    Liang, Xiaodan
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1592 - 1600
  • [23] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    FRONTIERS OF COMPUTER SCIENCE, 2025, 19 (01)
  • [24] TEXT-IMAGE DE-CONTEXTUALIZATION DETECTION USING VISION-LANGUAGE MODELS
    Huang, Mingzhen
    Jia, Shan
    Chang, Ming-Ching
    Lyu, Siwei
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8967 - 8971
  • [25] LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models
    Shi, Cheng
    Yang, Sibei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2920 - 2929
  • [26] Unsupervised Prototype Adapter for Vision-Language Models
    Zhang, Yi
    Zhang, Ce
    Hu, Xueting
    He, Zhihai
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 197 - 209
  • [27] Conditional Prompt Learning for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16795 - 16804
  • [28] Consistent prompt learning for vision-language models
    Zhang, Yonggang
    Tian, Xinmei
    KNOWLEDGE-BASED SYSTEMS, 2025, 310
  • [29] Conceptual Codebook Learning for Vision-Language Models
    Zhang, Yi
    Yu, Ke
    Wu, Siqi
    He, Zhihai
    COMPUTER VISION - ECCV 2024, PT LXXVII, 2024, 15135 : 235 - 251
  • [30] VLCAP: VISION-LANGUAGE WITH CONTRASTIVE LEARNING FOR COHERENT VIDEO PARAGRAPH CAPTIONING
    Yamazaki, Kashu
    Truong, Sang
    Vo, Khoa
    Kidd, Michael
    Rainwater, Chase
    Luu, Khoa
    Le, Ngan
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 3656 - 3661