Perceptual Grouping in Contrastive Vision-Language Models

被引:10
|
作者
Ranasinghe, Kanchana [1 ]
McKinzie, Brandon [1 ]
Ravi, Sachin [1 ]
Yang, Yinfei [1 ]
Toshev, Alexander [1 ]
Shlens, Jonathon [1 ]
机构
[1] Apple, Cupertino, CA 95014 USA
关键词
D O I
10.1109/ICCV51070.2023.00513
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.
引用
收藏
页码:5548 / 5561
页数:14
相关论文
共 50 条
  • [31] Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
    Zhang, Xinsong
    Zeng, Yan
    Zhang, Jipeng
    Li, Hang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 551 - 568
  • [32] VinVL: Revisiting Visual Representations in Vision-Language Models
    Zhang, Pengchuan
    Li, Xiujun
    Hu, Xiaowei
    Yang, Jianwei
    Zhang, Lei
    Wang, Lijuan
    Choi, Yejin
    Gao, Jianfeng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
  • [33] Evaluating Attribute Comprehension in Large Vision-Language Models
    Zhang, Haiwen
    Yang, Zixi
    Liu, Yuanzhi
    Wang, Xinran
    He, Zheqi
    Liang, Kongming
    Ma, Zhanyu
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 98 - 113
  • [34] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
    Salin, Emmanuelle
    Ayache, Stephane
    Favre, Benoit
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
  • [35] Attention Prompting on Image for Large Vision-Language Models
    Yu, Runpeng
    Yu, Weihao
    Wang, Xinchao
    COMPUTER VISION - ECCV 2024, PT XXX, 2025, 15088 : 251 - 268
  • [36] Learning with Enriched Inductive Biases for Vision-Language Models
    Yang, Lingxiao
    Zhang, Ru-Yuan
    Chen, Qi
    Xie, Xiaohua
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
  • [37] Effectiveness assessment of recent large vision-language models
    Yao Jiang
    Xinyu Yan
    Ge-Peng Ji
    Keren Fu
    Meijun Sun
    Huan Xiong
    Deng-Ping Fan
    Fahad Shahbaz Khan
    Visual Intelligence, 2 (1):
  • [38] Tuning Vision-Language Models With Multiple Prototypes Clustering
    Guo, Meng-Hao
    Zhang, Yi
    Mu, Tai-Jiang
    Huang, Sharon X.
    Hu, Shi-Min
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 11186 - 11199
  • [39] uCAP: An Unsupervised Prompting Method for Vision-Language Models
    Nguyen, A. Tuan
    Tai, Kai Sheng
    Chen, Bor-Chun
    Shukla, Satya Narayan
    Yu, Harichao
    Torr, Philip
    Tian, Tai-Peng
    Lim, Ser-Nam
    COMPUTER VISION - ECCV 2024, PT LXXIV, 2025, 15132 : 425 - 439
  • [40] Disease-Informed Adaptation of Vision-Language Models
    Zhang, Jiajin
    Wang, Ge
    Kalra, Mannudeep K.
    Yan, Pingkun
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI, 2024, 15011 : 232 - 242