Text encoders bottleneck compositionality in contrastive vision-language models

被引:0
|
作者
Kamath, Amita [1 ]
Hessel, Jack [2 ]
Chang, Kai-Wei [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90024 USA
[2] Allen Inst AI, Seattle, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach does not require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on more compositional inputs, including object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multimodal matching performance on ControlledImCaps: a new evaluation benchmark we collect and release consisting of fine-grained compositional images and captions. Specifically, our results suggest textonly recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive VL models. We release our datasets and code.
引用
收藏
页码:4933 / 4944
页数:12
相关论文
共 50 条
  • [31] Vision-Language Models for Robot Success Detection
    Luo, Fiona
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23750 - 23752
  • [32] Exploring Vision-Language Models for Imbalanced Learning
    Wang Y.
    Yu Z.
    Wang J.
    Heng Q.
    Chen H.
    Ye W.
    Xie R.
    Xie X.
    Zhang S.
    International Journal of Computer Vision, 2024, 132 (01) : 224 - 237
  • [33] Adversarial Prompt Tuning for Vision-Language Models
    Zhang, Jiaming
    Ma, Xingjun
    Wang, Xin
    Qiu, Lingyu
    Wang, Jiaqi
    Jiang, Yu-Gang
    Sang, Jitao
    COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 56 - 72
  • [34] Task Residual for Tuning Vision-Language Models
    Yu, Tao
    Lu, Zhihe
    Jin, Xin
    Chen, Zhibo
    Wang, Xinchao
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
  • [35] Adventures of Trustworthy Vision-Language Models: A Survey
    Vatsa, Mayank
    Jain, Anubhooti
    Singh, Richa
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
  • [36] Equivariant Similarity for Vision-Language Foundation Models
    Wang, Tan
    Lin, Kevin
    Li, Linjie
    Lin, Chung-Ching
    Yang, Zhengyuan
    Zhang, Hanwang
    Liu, Zicheng
    Wang, Lijuan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11964 - 11974
  • [37] Towards Better Vision-Inspired Vision-Language Models
    Cao, Yun-Hao
    Ji, Kaixiang
    Huang, Ziyuan
    Zheng, Chuanyang
    Liu, Jiajia
    Wang, Jian
    Chen, Jingdong
    Yang, Ming
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13537 - 13547
  • [38] SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text
    Zou, Bo
    Yang, Chao
    Quan, Chengbin
    Zhao, Youjian
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 519 - 528
  • [39] Generative Negative Text Replay for Continual Vision-Language Pretraining
    Yan, Shipeng
    Hong, Lanqing
    Xu, Hang
    Han, Jianhua
    Tuytelaars, Tinne
    Li, Zhenguo
    He, Xuming
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 22 - 38
  • [40] Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-training
    Chen, Xiaofei
    He, Yuting
    Xue, Cheng
    Ge, Rongjun
    Li, Shuo
    Yang, Guanyu
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT I, 2023, 14220 : 405 - 415