Vision-Language Models for Vision Tasks: A Survey

被引:8
|
作者
Zhang, Jingyi [1 ]
Huang, Jiaxing [1 ]
Jin, Sheng [1 ]
Lu, Shijian [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
关键词
Task analysis; Visualization; Training; Deep learning; Surveys; Data models; Predictive models; Big Data; big model; deep learning; deep neural network; knowledge distillation; object detection; pre-training; semantic segmentation; transfer learning; vision-language model; visual recognition; image classification; CLASSIFICATION;
D O I
10.1109/TPAMI.2024.3369699
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.
引用
收藏
页码:5625 / 5644
页数:20
相关论文
共 50 条
  • [1] Adapting vision-language AI models to cardiology tasks
    Arnaout, Rima
    [J]. NATURE MEDICINE, 2024,
  • [2] Adventures of Trustworthy Vision-Language Models: A Survey
    Vatsa, Mayank
    Jain, Anubhooti
    Singh, Richa
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
  • [3] Causal Attention for Vision-Language Tasks
    Yang, Xu
    Zhang, Hanwang
    Qi, Guojun
    Cai, Jianfei
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9842 - 9852
  • [4] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
    Wang, Wenhui
    Bao, Hangbo
    Dong, Li
    Bjorck, Johan
    Peng, Zhiliang
    Liu, Qiang
    Aggarwal, Kriti
    Mohammed, Owais Khan
    Singhal, Saksham
    Som, Subhojit
    Wei, Furu
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
  • [5] Vision-language navigation: a survey and taxonomy
    Wansen Wu
    Tao Chang
    Xinmeng Li
    Quanjun Yin
    Yue Hu
    [J]. Neural Computing and Applications, 2024, 36 : 3291 - 3316
  • [6] Vision-language navigation: a survey and taxonomy
    Wu, Wansen
    Chang, Tao
    Li, Xinmeng
    Yin, Quanjun
    Hu, Yue
    [J]. NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07): : 3291 - 3316
  • [7] NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
    Sammani, Fawaz
    Mukherjee, Tanmoy
    Deligiannis, Nikos
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8312 - 8322
  • [8] Learning to Prompt for Vision-Language Models
    Kaiyang Zhou
    Jingkang Yang
    Chen Change Loy
    Ziwei Liu
    [J]. International Journal of Computer Vision, 2022, 130 : 2337 - 2348
  • [9] Learning to Prompt for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
  • [10] VISION-LANGUAGE MODELS AS SUCCESS DETECTORS
    Du, Yuqing
    Konyushkova, Ksenia
    Denil, Misha
    Raju, Akhil
    Landon, Jessica
    Hill, Felix
    de Freitas, Nando
    Cabi, Serkan
    [J]. CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 120 - 136