Vision-Language Models for Vision Tasks: A Survey

被引：8

作者：

Zhang, Jingyi ^{[1
]}

Huang, Jiaxing ^{[1
]}

Jin, Sheng ^{[1
]}

Lu, Shijian ^{[1
]}

机构：

[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2024年 / 46卷 / 08期

关键词：

Task analysis; Visualization; Training; Deep learning; Surveys; Data models; Predictive models; Big Data; big model; deep learning; deep neural network; knowledge distillation; object detection; pre-training; semantic segmentation; transfer learning; vision-language model; visual recognition; image classification; CLASSIFICATION;

D O I：

10.1109/TPAMI.2024.3369699

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

引用

页码：5625 / 5644

页数：20

共 50 条

[1] Adapting vision-language AI models to cardiology tasks
Arnaout, Rima
[J]. NATURE MEDICINE, 2024,
[2] Adventures of Trustworthy Vision-Language Models: A Survey
Vatsa, Mayank
Jain, Anubhooti
Singh, Richa
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
[3] Causal Attention for Vision-Language Tasks
Yang, Xu
Zhang, Hanwang
Qi, Guojun
Cai, Jianfei
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9842 - 9852
[4] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
Wang, Wenhui
Bao, Hangbo
Dong, Li
Bjorck, Johan
Peng, Zhiliang
Liu, Qiang
Aggarwal, Kriti
Mohammed, Owais Khan
Singhal, Saksham
Som, Subhojit
Wei, Furu
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
[5] Vision-language navigation: a survey and taxonomy
Wansen Wu
Tao Chang
Xinmeng Li
Quanjun Yin
Yue Hu
[J]. Neural Computing and Applications, 2024, 36 : 3291 - 3316
[6] Vision-language navigation: a survey and taxonomy
Wu, Wansen
Chang, Tao
Li, Xinmeng
Yin, Quanjun
Hu, Yue
[J]. NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07): : 3291 - 3316
[7] NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
Sammani, Fawaz
Mukherjee, Tanmoy
Deligiannis, Nikos
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8312 - 8322
[8] Learning to Prompt for Vision-Language Models
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
[J]. International Journal of Computer Vision, 2022, 130 : 2337 - 2348
[9] Learning to Prompt for Vision-Language Models
Zhou, Kaiyang
Yang, Jingkang
Loy, Chen Change
Liu, Ziwei
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
[10] VISION-LANGUAGE MODELS AS SUCCESS DETECTORS
Du, Yuqing
Konyushkova, Ksenia
Denil, Misha
Raju, Akhil
Landon, Jessica
Hill, Felix
de Freitas, Nando
Cabi, Serkan
[J]. CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 120 - 136

← 1 2 3 4 5 →