CREPE: Can Vision-Language Foundation Models Reason Compositionally?

被引:4
|
作者
Ma, Zixian [1 ]
Hong, Jerry [1 ]
Gul, Mustafa Omer [2 ]
Ciandhi, Mona [3 ]
Geo, Irena [1 ]
krishna, Ranjay [4 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Cornell Univ, Ithaca, NY USA
[3] Univ Penn, Philadelphia, PA USA
[4] Univ Washington, Seattle, WA USA
关键词
D O I
10.1109/CVPR52729.2023.01050
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that-across 7 architectures trained with 4 algorithms on massive datasets-they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over 370K image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate 325K, 316K, and 309K hard negative captions for a subset of the pairs. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 278K hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to 9%. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.
引用
收藏
页码:10910 / 10921
页数:12
相关论文
共 50 条
  • [31] Adapting vision-language AI models to cardiology tasks
    Arnaout, Rima
    [J]. NATURE MEDICINE, 2024,
  • [32] Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
    Wenhao Wu
    Zhun Sun
    Yuxin Song
    Jingdong Wang
    Wanli Ouyang
    [J]. International Journal of Computer Vision, 2024, 132 (2) : 392 - 409
  • [33] ILLUME: Rationalizing Vision-Language Models through Human Interactions
    Brack, Manuel
    Schramowski, Patrick
    Deiseroth, Bjorn
    Kersting, Kristian
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [34] GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
    Li, Xin
    Lian, Dongze
    Lu, Zhihe
    Bai, Jiawang
    Chen, Zhibo
    Wang, Xinchao
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [35] Adapting Vision-Language Models via Learning to Inject Knowledge
    Xuan, Shiyu
    Yang, Ming
    Zhang, Shiliang
    [J]. IEEE Transactions on Image Processing, 2024, 33 : 5798 - 5809
  • [36] Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
    Ma, Chengcheng
    Liu, Yang
    Deng, Jiankang
    Xie, Lingxi
    Dong, Weiming
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4616 - 4629
  • [37] Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
    Wu, Wenhao
    Sun, Zhun
    Song, Yuxin
    Wang, Jingdong
    Ouyang, Wanli
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) : 392 - 409
  • [38] Distribution-Aware Prompt Tuning for Vision-Language Models
    Cho, Eulrang
    Kim, Jooyeon
    Kim, Hyunwoo J.
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21947 - 21956
  • [39] Parts of Speech-Grounded Subspaces in Vision-Language Models
    Oldfield, James
    Tzelepis, Christos
    Panagakis, Yannis
    Nicolaou, Mihalis A.
    Patras, Ioannis
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [40] Text Promptable Surgical Instrument Segmentation with Vision-Language Models
    Zhou, Zijian
    Alabi, Oluwatosin
    Wei, Meng
    Vercauteren, Tom
    Shi, Miaojing
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,