Perceptual Grouping in Contrastive Vision-Language Models

被引：10

作者：

Ranasinghe, Kanchana ^{[1
]}

McKinzie, Brandon ^{[1
]}

Ravi, Sachin ^{[1
]}

Yang, Yinfei ^{[1
]}

Toshev, Alexander ^{[1
]}

Shlens, Jonathon ^{[1
]}

机构：

[1] Apple, Cupertino, CA 95014 USA

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.00513

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

引用

页码：5548 / 5561

页数：14

共 50 条

[1] Task Bias in Contrastive Vision-Language Models
Menon, Sachit
Chandratreya, Ishaan Preetam
Vondrick, Carl
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
[2] Text encoders bottleneck compositionality in contrastive vision-language models
Kamath, Amita
Hessel, Jack
Chang, Kai-Wei
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4933 - 4944
[3] Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
Wang, Xintong
Pan, Jingheng
Ding, Liang
Biemann, Chris
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15840 - 15853
[4] Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training
Wan, David
Cho, Jaemin
Stengel-Eskin, Elias
Bansal, Mohit
COMPUTER VISION - ECCV 2024, PT LXXIX, 2025, 15137 : 198 - 215
[5] Vision-Language Models for Vision Tasks: A Survey
Zhang, Jingyi
Huang, Jiaxing
Jin, Sheng
Lu, Shijian
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
[6] Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Leng, Sicong
Zhang, Hang
Chen, Guanzheng
Li, Xin
Lug, Shijian
Miao, Chunyan
Bing, Lidong
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13872 - 13882
[7] Learning to Prompt for Vision-Language Models
Zhou, Kaiyang
Yang, Jingkang
Loy, Chen Change
Liu, Ziwei
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
[8] Vision-Language Models for Biomedical Applications
Thapa, Surendrabikram
Naseem, Usman
Zhou, Luping
Kim, Jinman
PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON VISION-LANGUAGE MODELS FOR BIOMEDICAL APPLICATIONS, VLM4BIO 2024, 2024, : 1 - 2
[9] Learning to Prompt for Vision-Language Models
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
International Journal of Computer Vision, 2022, 130 : 2337 - 2348
[10] The Neglected Tails in Vision-Language Models
Parashar, Shubham
Lin, Zhiqiu
Liu, Tian
Dong, Xiangjue
Li, Yanan
Ramanan, Deva
Caverlee, James
Kong, Shu
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 12988 - 12997

← 1 2 3 4 5 →