Parts of Speech-Grounded Subspaces in Vision-Language Models

被引:0
|
作者
Oldfield, James [1 ]
Tzelepis, Christos [1 ]
Panagakis, Yannis [2 ,3 ]
Nicolaou, Mihalis A. [4 ]
Patras, Ioannis [1 ]
机构
[1] Queen Mary Univ London, London, England
[2] Natl & Kapodistrian Univ Athens, Athens, Greece
[3] Archimedes Athena RC, Maroussi, Greece
[4] Cyprus Inst, Aglandjia, Cyprus
基金
欧盟地平线“2020”;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased towards specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What's more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists' painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists' styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.
引用
收藏
页数:25
相关论文
共 50 条
  • [21] ECO: Ensembling Context Optimization for Vision-Language Models
    Agnolucci, Lorenzo
    Baldrati, Alberto
    Todino, Francesco
    Becattini, Federico
    Bertini, Marco
    Del Bimbo, Alberto
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2803 - 2807
  • [22] DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Ren, Xuancheng
    Fan, Wei
    Sun, Xu
    Zou, Yuexian
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (01)
  • [23] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
    Salin, Emmanuelle
    Ayache, Stephane
    Favre, Benoit
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
  • [24] Effectiveness assessment of recent large vision-language models
    Yao Jiang
    Xinyu Yan
    Ge-Peng Ji
    Keren Fu
    Meijun Sun
    Huan Xiong
    Deng-Ping Fan
    Fahad Shahbaz Khan
    [J]. Visual Intelligence, 2 (1):
  • [25] On Evaluating Adversarial Robustness of Large Vision-Language Models
    Zhao, Yunqing
    Pang, Tianyu
    Du, Chao
    Yang, Xiao
    Li, Chongxuan
    Cheung, Ngai-Man
    Lin, Min
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [26] Compositional Kronecker Context Optimization for vision-language models
    Ding, Kun
    Li, Xiaohui
    Yu, Qiang
    Wang, Ying
    Zhang, Haojian
    Xiang, Shiming
    [J]. NEUROCOMPUTING, 2024, 608
  • [27] Adapting vision-language AI models to cardiology tasks
    Arnaout, Rima
    [J]. NATURE MEDICINE, 2024,
  • [28] Improving Medical Speech-to-Text Accuracy using Vision-Language Pre-training Models
    Huh, Jaeyoung
    Park, Sangjoon
    Lee, Jeong Eun
    Ye, Jong Chul
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (03) : 1692 - 1703
  • [29] UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
    Li, Wei
    Gao, Can
    Niu, Guocheng
    Xiao, Xinyan
    Liu, Hao
    Liu, Jiachen
    Wu, Hua
    Wang, Haifeng
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3187 - 3201
  • [30] GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
    Li, Xin
    Lian, Dongze
    Lu, Zhihe
    Bai, Jiawang
    Chen, Zhibo
    Wang, Xinchao
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,