Perceptual Grouping in Contrastive Vision-Language Models

被引:10
|
作者
Ranasinghe, Kanchana [1 ]
McKinzie, Brandon [1 ]
Ravi, Sachin [1 ]
Yang, Yinfei [1 ]
Toshev, Alexander [1 ]
Shlens, Jonathon [1 ]
机构
[1] Apple, Cupertino, CA 95014 USA
关键词
D O I
10.1109/ICCV51070.2023.00513
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.
引用
收藏
页码:5548 / 5561
页数:14
相关论文
共 50 条
  • [1] Task Bias in Contrastive Vision-Language Models
    Menon, Sachit
    Chandratreya, Ishaan Preetam
    Vondrick, Carl
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
  • [2] Text encoders bottleneck compositionality in contrastive vision-language models
    Kamath, Amita
    Hessel, Jack
    Chang, Kai-Wei
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4933 - 4944
  • [3] Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    Wang, Xintong
    Pan, Jingheng
    Ding, Liang
    Biemann, Chris
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15840 - 15853
  • [4] Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training
    Wan, David
    Cho, Jaemin
    Stengel-Eskin, Elias
    Bansal, Mohit
    COMPUTER VISION - ECCV 2024, PT LXXIX, 2025, 15137 : 198 - 215
  • [5] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [6] Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
    Leng, Sicong
    Zhang, Hang
    Chen, Guanzheng
    Li, Xin
    Lug, Shijian
    Miao, Chunyan
    Bing, Lidong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13872 - 13882
  • [7] Learning to Prompt for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
  • [8] Vision-Language Models for Biomedical Applications
    Thapa, Surendrabikram
    Naseem, Usman
    Zhou, Luping
    Kim, Jinman
    PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON VISION-LANGUAGE MODELS FOR BIOMEDICAL APPLICATIONS, VLM4BIO 2024, 2024, : 1 - 2
  • [9] Learning to Prompt for Vision-Language Models
    Kaiyang Zhou
    Jingkang Yang
    Chen Change Loy
    Ziwei Liu
    International Journal of Computer Vision, 2022, 130 : 2337 - 2348
  • [10] The Neglected Tails in Vision-Language Models
    Parashar, Shubham
    Lin, Zhiqiu
    Liu, Tian
    Dong, Xiangjue
    Li, Yanan
    Ramanan, Deva
    Caverlee, James
    Kong, Shu
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 12988 - 12997