I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

被引:4
|
作者
Naeem, Muhammad Ferjad [1 ]
Khan, Muhammad Gul Zain Ali [2 ,3 ]
Xian, Yongqin [5 ]
Afzal, Muhammad Zeshan [2 ,3 ]
Stricker, Didier [2 ,3 ]
Van Gool, Luc [1 ]
Tombari, Federico [4 ,5 ]
机构
[1] Swiss Fed Inst Technol, Zurich, Switzerland
[2] TUKL, Kaiserslautern, Germany
[3] DFKI, Kaiserslautern, Germany
[4] TUM, Munich, Germany
[5] Google, Hamburg, Germany
关键词
D O I
10.1109/CVPR52729.2023.01456
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class (referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings. Code available at https://github.com/ferjad/I2DFormer
引用
收藏
页码:15169 / 15179
页数:11
相关论文
共 7 条
  • [1] Multi-view enhanced zero-shot node classification
    Wang, Jiahui
    Wu, Likang
    Zhao, Hongke
    Jia, Ning
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (06)
  • [2] I2DFormer+: Learning Image to Document Summary Attention for Zero-Shot Image Classification
    Naeem, Muhammad Ferjad
    Xian, Yongqin
    Gool, Luc Van
    Tombari, Federico
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (09) : 3806 - 3822
  • [3] Saliency-based Multi-View Mixed Language Training for Zero-shot Cross-lingual Classification
    Lai, Siyu
    Huang, Hui
    Jing, Dong
    Chen, Yufeng
    Xu, Jinan
    Liu, Jian
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 599 - 610
  • [4] DILF: Differentiable rendering-based multi-view Image-Language Fusion for zero-shot 3D shape understanding
    Ning, Xin
    Yu, Zaiyang
    Li, Lusi
    Li, Weijun
    Tiwari, Prayag
    INFORMATION FUSION, 2024, 102
  • [5] CDZL: a controllable diversity zero-shot image caption model using large language models
    Xin Zhao
    Weiwei Kong
    Zongyao Liu
    Menghao Wang
    Yiwen Li
    Signal, Image and Video Processing, 2025, 19 (4)
  • [6] CGUN-2A: Deep Graph Convolutional Network via Contrastive Learning for Large-Scale Zero-Shot Image Classification
    Li, Liangwei
    Liu, Lin
    Du, Xiaohui
    Wang, Xiangzhou
    Zhang, Ziruo
    Zhang, Jing
    Zhang, Ping
    Liu, Juanxiu
    SENSORS, 2022, 22 (24)
  • [7] A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports
    Sushil, Madhumita
    Zack, Travis
    Mandair, Divneet
    Zheng, Zhiwei
    Wali, Ahmed
    Yu, Yan-Ning
    Quan, Yuwei
    Lituiev, Dmytro
    Butte, Atul J.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (10) : 2315 - 2327