Perceptual Grouping in Contrastive Vision-Language Models

被引：10

作者：

Ranasinghe, Kanchana ^{[1
]}

McKinzie, Brandon ^{[1
]}

Ravi, Sachin ^{[1
]}

Yang, Yinfei ^{[1
]}

Toshev, Alexander ^{[1
]}

Shlens, Jonathon ^{[1
]}

机构：

[1] Apple, Cupertino, CA 95014 USA

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.00513

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

引用

页码：5548 / 5561

页数：14

共 50 条

[21] VLCAP: VISION-LANGUAGE WITH CONTRASTIVE LEARNING FOR COHERENT VIDEO PARAGRAPH CAPTIONING
Yamazaki, Kashu
Truong, Sang
Vo, Khoa
Kidd, Michael
Rainwater, Chase
Luu, Khoa
Le, Ngan
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 3656 - 3661
[22] Vision-Language Models for Robot Success Detection
Luo, Fiona
THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23750 - 23752
[23] Exploring Vision-Language Models for Imbalanced Learning
Wang Y.
Yu Z.
Wang J.
Heng Q.
Chen H.
Ye W.
Xie R.
Xie X.
Zhang S.
International Journal of Computer Vision, 2024, 132 (01) : 224 - 237
[24] Adversarial Prompt Tuning for Vision-Language Models
Zhang, Jiaming
Ma, Xingjun
Wang, Xin
Qiu, Lingyu
Wang, Jiaqi
Jiang, Yu-Gang
Sang, Jitao
COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 56 - 72
[25] Task Residual for Tuning Vision-Language Models
Yu, Tao
Lu, Zhihe
Jin, Xin
Chen, Zhibo
Wang, Xinchao
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
[26] Adventures of Trustworthy Vision-Language Models: A Survey
Vatsa, Mayank
Jain, Anubhooti
Singh, Richa
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
[27] Equivariant Similarity for Vision-Language Foundation Models
Wang, Tan
Lin, Kevin
Li, Linjie
Lin, Chung-Ching
Yang, Zhengyuan
Zhang, Hanwang
Liu, Zicheng
Wang, Lijuan
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11964 - 11974
[28] Towards Better Vision-Inspired Vision-Language Models
Cao, Yun-Hao
Ji, Kaixiang
Huang, Ziyuan
Zheng, Chuanyang
Liu, Jiajia
Wang, Jian
Chen, Jingdong
Yang, Ming
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13537 - 13547
[29] Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-training
Chen, Xiaofei
He, Yuting
Xue, Cheng
Ge, Rongjun
Li, Shuo
Yang, Guanyu
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT I, 2023, 14220 : 405 - 415
[30] Improving Medical Vision-Language Contrastive Pretraining With Semantics-Aware Triage
Liu, Bo
Lu, Donghuan
Wei, Dong
Wu, Xian
Wang, Yan
Zhang, Yu
Zheng, Yefeng
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (12) : 3579 - 3589

← 1 2 3 4 5 →