CREPE: Can Vision-Language Foundation Models Reason Compositionally?

被引:4
|
作者
Ma, Zixian [1 ]
Hong, Jerry [1 ]
Gul, Mustafa Omer [2 ]
Ciandhi, Mona [3 ]
Geo, Irena [1 ]
krishna, Ranjay [4 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Cornell Univ, Ithaca, NY USA
[3] Univ Penn, Philadelphia, PA USA
[4] Univ Washington, Seattle, WA USA
关键词
D O I
10.1109/CVPR52729.2023.01050
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that-across 7 architectures trained with 4 algorithms on massive datasets-they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over 370K image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate 325K, 316K, and 309K hard negative captions for a subset of the pairs. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 278K hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to 9%. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.
引用
收藏
页码:10910 / 10921
页数:12
相关论文
共 50 条
  • [1] Equivariant Similarity for Vision-Language Foundation Models
    Wang, Tan
    Lin, Kevin
    Li, Linjie
    Lin, Chung-Ching
    Yang, Zhengyuan
    Zhang, Hanwang
    Liu, Zicheng
    Wang, Lijuan
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11964 - 11974
  • [2] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
    Salin, Emmanuelle
    Ayache, Stephane
    Favre, Benoit
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
  • [3] Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification
    Peng, Wenshuo
    Zhang, Kaipeng
    Yang, Yue
    Zhang, Hao
    Qiao, Yu
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4506 - 4514
  • [4] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [5] Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models
    Zhou, Andy
    Wang, Jindong
    Wang, Yu-Xiong
    Wang, Haohan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [6] Learning to Prompt for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
  • [7] Learning to Prompt for Vision-Language Models
    Kaiyang Zhou
    Jingkang Yang
    Chen Change Loy
    Ziwei Liu
    [J]. International Journal of Computer Vision, 2022, 130 : 2337 - 2348
  • [8] VISION-LANGUAGE MODELS AS SUCCESS DETECTORS
    Du, Yuqing
    Konyushkova, Ksenia
    Denil, Misha
    Raju, Akhil
    Landon, Jessica
    Hill, Felix
    de Freitas, Nando
    Cabi, Serkan
    [J]. CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 120 - 136
  • [9] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    [J]. Frontiers of Computer Science, 2025, 19 (01)
  • [10] Conditional Prompt Learning for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16795 - 16804