Text encoders bottleneck compositionality in contrastive vision-language models

被引:0
|
作者
Kamath, Amita [1 ]
Hessel, Jack [2 ]
Chang, Kai-Wei [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90024 USA
[2] Allen Inst AI, Seattle, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach does not require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on more compositional inputs, including object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multimodal matching performance on ControlledImCaps: a new evaluation benchmark we collect and release consisting of fine-grained compositional images and captions. Specifically, our results suggest textonly recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive VL models. We release our datasets and code.
引用
收藏
页码:4933 / 4944
页数:12
相关论文
共 50 条
  • [1] Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
    Singh, Harman
    Zhang, Pengchuan
    Wang, Qifan
    Wang, Mengjiao
    Xiong, Wenhan
    Du, Jingfei
    Chen, Yu
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 869 - 893
  • [2] Task Bias in Contrastive Vision-Language Models
    Menon, Sachit
    Chandratreya, Ishaan Preetam
    Vondrick, Carl
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
  • [3] Perceptual Grouping in Contrastive Vision-Language Models
    Ranasinghe, Kanchana
    McKinzie, Brandon
    Ravi, Sachin
    Yang, Yinfei
    Toshev, Alexander
    Shlens, Jonathon
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5548 - 5561
  • [4] SUGARCREPE: Fixing Hackable Benchmarks for Vision-Language Compositionality
    Hsieh, Cheng-Yu
    Zhang, Jieyu
    Ma, Zixian
    Kembhavi, Aniruddha
    Krishna, Ranjay
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    Wang, Xintong
    Pan, Jingheng
    Ding, Liang
    Biemann, Chris
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15840 - 15853
  • [6] Text Promptable Surgical Instrument Segmentation with Vision-Language Models
    Zhou, Zijian
    Alabi, Oluwatosin
    Wei, Meng
    Vercauteren, Tom
    Shi, Miaojing
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] Learning the Visualness of Text Using Large Vision-Language Models
    Verma, Gaurav
    Rossi, Ryan A.
    Tensmeyer, Christopher
    Gu, Jiuxiang
    Nenkova, Ani
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2394 - 2408
  • [8] Modeling Multimodal Uncertainties via Probability Distribution Encoders Included Vision-Language Models
    Wang, Junjie
    Ji, Yatai
    Zhang, Yuxiang
    Zhu, Yanru
    Sakai, Tetsuya
    IEEE ACCESS, 2024, 12 : 420 - 434
  • [9] Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training
    Wan, David
    Cho, Jaemin
    Stengel-Eskin, Elias
    Bansal, Mohit
    COMPUTER VISION - ECCV 2024, PT LXXIX, 2025, 15137 : 198 - 215
  • [10] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644