Perceptual Grouping in Contrastive Vision-Language Models

被引:10
|
作者
Ranasinghe, Kanchana [1 ]
McKinzie, Brandon [1 ]
Ravi, Sachin [1 ]
Yang, Yinfei [1 ]
Toshev, Alexander [1 ]
Shlens, Jonathon [1 ]
机构
[1] Apple, Cupertino, CA 95014 USA
关键词
D O I
10.1109/ICCV51070.2023.00513
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.
引用
收藏
页码:5548 / 5561
页数:14
相关论文
共 50 条
  • [41] DeAR: Debiasing Vision-Language Models with Additive Residuals
    Seth, Ashish
    Hemani, Mayur
    Agarwal, Chirag
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6820 - 6829
  • [42] Learning Domain Invariant Prompt for Vision-Language Models
    Zhao, Cairong
    Wang, Yubin
    Jiang, Xinyang
    Shen, Yifei
    Song, Kaitao
    Li, Dongsheng
    Miao, Duoqian
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
  • [43] ECO: Ensembling Context Optimization for Vision-Language Models
    Agnolucci, Lorenzo
    Baldrati, Alberto
    Todino, Francesco
    Becattini, Federico
    Bertini, Marco
    Del Bimbo, Alberto
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2803 - 2807
  • [44] Scaling Vision-Language Models with Sparse Mixture of Experts
    Shen, Sheng
    Yao, Zhewei
    Li, Chunyuan
    Darrell, Trevor
    Keutzer, Kurt
    He, Yuxiong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11329 - 11344
  • [45] DPO: Discrete Prompt Optimization for Vision-Language Models
    Liang, Nanhao
    Liu, Yong
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 671 - 675
  • [46] On Evaluating Adversarial Robustness of Large Vision-Language Models
    Zhao, Yunqing
    Pang, Tianyu
    Du, Chao
    Yang, Xiao
    Li, Chongxuan
    Cheung, Ngai-Man
    Lin, Min
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [47] Compositional Kronecker Context Optimization for vision-language models
    Ding, Kun
    Li, Xiaohui
    Yu, Qiang
    Wang, Ying
    Zhang, Haojian
    Xiang, Shiming
    NEUROCOMPUTING, 2024, 608
  • [48] Evaluating Object Hallucination in Large Vision-Language Models
    Li, Yifan
    Du, Yifan
    Zhou, Kun
    Wang, Jinpeng
    Zhao, Wayne Xin
    Wen, Ji-Rong
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 292 - 305
  • [49] Adapting vision-language AI models to cardiology tasks
    Arnaout, Rima
    NATURE MEDICINE, 2024, 30 (05) : 1245 - 1246
  • [50] BRAVE: Broadening the Visual Encoding of Vision-Language Models
    Kar, Oguzhan Fatih
    Tonioni, Alessio
    Poklukar, Petra
    Kulshrestha, Achin
    Zamir, Amir
    Tombari, Federico
    COMPUTER VISION - ECCV 2024, PT XVI, 2025, 15074 : 113 - 132