Perceptual Grouping in Contrastive Vision-Language Models

被引：10

作者：

Ranasinghe, Kanchana ^{[1
]}

McKinzie, Brandon ^{[1
]}

Ravi, Sachin ^{[1
]}

Yang, Yinfei ^{[1
]}

Toshev, Alexander ^{[1
]}

Shlens, Jonathon ^{[1
]}

机构：

[1] Apple, Cupertino, CA 95014 USA

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.00513

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

引用

页码：5548 / 5561

页数：14

共 50 条

[31] Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
Zhang, Xinsong
Zeng, Yan
Zhang, Jipeng
Li, Hang
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 551 - 568
[32] VinVL: Revisiting Visual Representations in Vision-Language Models
Zhang, Pengchuan
Li, Xiujun
Hu, Xiaowei
Yang, Jianwei
Zhang, Lei
Wang, Lijuan
Choi, Yejin
Gao, Jianfeng
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
[33] Evaluating Attribute Comprehension in Large Vision-Language Models
Zhang, Haiwen
Yang, Zixi
Liu, Yuanzhi
Wang, Xinran
He, Zheqi
Liang, Kongming
Ma, Zhanyu
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 98 - 113
[34] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
Salin, Emmanuelle
Ayache, Stephane
Favre, Benoit
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
[35] Attention Prompting on Image for Large Vision-Language Models
Yu, Runpeng
Yu, Weihao
Wang, Xinchao
COMPUTER VISION - ECCV 2024, PT XXX, 2025, 15088 : 251 - 268
[36] Learning with Enriched Inductive Biases for Vision-Language Models
Yang, Lingxiao
Zhang, Ru-Yuan
Chen, Qi
Xie, Xiaohua
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
[37] Effectiveness assessment of recent large vision-language models
Yao Jiang
Xinyu Yan
Ge-Peng Ji
Keren Fu
Meijun Sun
Huan Xiong
Deng-Ping Fan
Fahad Shahbaz Khan
Visual Intelligence, 2 (1):
[38] Tuning Vision-Language Models With Multiple Prototypes Clustering
Guo, Meng-Hao
Zhang, Yi
Mu, Tai-Jiang
Huang, Sharon X.
Hu, Shi-Min
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 11186 - 11199
[39] uCAP: An Unsupervised Prompting Method for Vision-Language Models
Nguyen, A. Tuan
Tai, Kai Sheng
Chen, Bor-Chun
Shukla, Satya Narayan
Yu, Harichao
Torr, Philip
Tian, Tai-Peng
Lim, Ser-Nam
COMPUTER VISION - ECCV 2024, PT LXXIV, 2025, 15132 : 425 - 439
[40] Disease-Informed Adaptation of Vision-Language Models
Zhang, Jiajin
Wang, Ge
Kalra, Mannudeep K.
Yan, Pingkun
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI, 2024, 15011 : 232 - 242

← 1 2 3 4 5 →