Vision-Language Models for Vision Tasks: A Survey

被引:8
|
作者
Zhang, Jingyi [1 ]
Huang, Jiaxing [1 ]
Jin, Sheng [1 ]
Lu, Shijian [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
关键词
Task analysis; Visualization; Training; Deep learning; Surveys; Data models; Predictive models; Big Data; big model; deep learning; deep neural network; knowledge distillation; object detection; pre-training; semantic segmentation; transfer learning; vision-language model; visual recognition; image classification; CLASSIFICATION;
D O I
10.1109/TPAMI.2024.3369699
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.
引用
收藏
页码:5625 / 5644
页数:20
相关论文
共 50 条
  • [41] ILLUME: Rationalizing Vision-Language Models through Human Interactions
    Brack, Manuel
    Schramowski, Patrick
    Deiseroth, Bjorn
    Kersting, Kristian
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [42] Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
    Wenhao Wu
    Zhun Sun
    Yuxin Song
    Jingdong Wang
    Wanli Ouyang
    [J]. International Journal of Computer Vision, 2024, 132 (2) : 392 - 409
  • [43] Vision-Language Models Performing Zero-Shot Tasks Exhibit Disparities Between Gender Groups
    Hall, Melissa
    Gustafson, Laura
    Adcock, Aaron
    Misra, Ishan
    Ross, Candace
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2770 - 2777
  • [44] Adapting Vision-Language Models via Learning to Inject Knowledge
    Xuan, Shiyu
    Yang, Ming
    Zhang, Shiliang
    [J]. IEEE Transactions on Image Processing, 2024, 33 : 5798 - 5809
  • [45] GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
    Li, Xin
    Lian, Dongze
    Lu, Zhihe
    Bai, Jiawang
    Chen, Zhibo
    Wang, Xinchao
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [46] Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
    Ma, Chengcheng
    Liu, Yang
    Deng, Jiankang
    Xie, Lingxi
    Dong, Weiming
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4616 - 4629
  • [47] Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
    Wu, Wenhao
    Sun, Zhun
    Song, Yuxin
    Wang, Jingdong
    Ouyang, Wanli
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) : 392 - 409
  • [48] CREPE: Can Vision-Language Foundation Models Reason Compositionally?
    Ma, Zixian
    Hong, Jerry
    Gul, Mustafa Omer
    Ciandhi, Mona
    Geo, Irena
    krishna, Ranjay
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10910 - 10921
  • [49] Text Promptable Surgical Instrument Segmentation with Vision-Language Models
    Zhou, Zijian
    Alabi, Oluwatosin
    Wei, Meng
    Vercauteren, Tom
    Shi, Miaojing
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [50] Distribution-Aware Prompt Tuning for Vision-Language Models
    Cho, Eulrang
    Kim, Jooyeon
    Kim, Hyunwoo J.
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21947 - 21956