Text encoders bottleneck compositionality in contrastive vision-language models

被引：0

作者：

Kamath, Amita ^{[1
]}

Hessel, Jack ^{[2
]}

Chang, Kai-Wei ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Los Angeles, CA 90024 USA

[2] Allen Inst AI, Seattle, WA USA

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach does not require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on more compositional inputs, including object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multimodal matching performance on ControlledImCaps: a new evaluation benchmark we collect and release consisting of fine-grained compositional images and captions. Specifically, our results suggest textonly recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive VL models. We release our datasets and code.

引用

页码：4933 / 4944

页数：12

共 50 条

[1] Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Singh, Harman
Zhang, Pengchuan
Wang, Qifan
Wang, Mengjiao
Xiong, Wenhan
Du, Jingfei
Chen, Yu
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 869 - 893
[2] Task Bias in Contrastive Vision-Language Models
Menon, Sachit
Chandratreya, Ishaan Preetam
Vondrick, Carl
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
[3] Perceptual Grouping in Contrastive Vision-Language Models
Ranasinghe, Kanchana
McKinzie, Brandon
Ravi, Sachin
Yang, Yinfei
Toshev, Alexander
Shlens, Jonathon
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5548 - 5561
[4] SUGARCREPE: Fixing Hackable Benchmarks for Vision-Language Compositionality
Hsieh, Cheng-Yu
Zhang, Jieyu
Ma, Zixian
Kembhavi, Aniruddha
Krishna, Ranjay
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[5] Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
Wang, Xintong
Pan, Jingheng
Ding, Liang
Biemann, Chris
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15840 - 15853
[6] Text Promptable Surgical Instrument Segmentation with Vision-Language Models
Zhou, Zijian
Alabi, Oluwatosin
Wei, Meng
Vercauteren, Tom
Shi, Miaojing
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[7] Learning the Visualness of Text Using Large Vision-Language Models
Verma, Gaurav
Rossi, Ryan A.
Tensmeyer, Christopher
Gu, Jiuxiang
Nenkova, Ani
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2394 - 2408
[8] Modeling Multimodal Uncertainties via Probability Distribution Encoders Included Vision-Language Models
Wang, Junjie
Ji, Yatai
Zhang, Yuxiang
Zhu, Yanru
Sakai, Tetsuya
IEEE ACCESS, 2024, 12 : 420 - 434
[9] Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training
Wan, David
Cho, Jaemin
Stengel-Eskin, Elias
Bansal, Mohit
COMPUTER VISION - ECCV 2024, PT LXXIX, 2025, 15137 : 198 - 215
[10] Vision-Language Models for Vision Tasks: A Survey
Zhang, Jingyi
Huang, Jiaxing
Jin, Sheng
Lu, Shijian
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644

← 1 2 3 4 5 →