Vision-Language Models for Vision Tasks: A Survey

被引：8

作者：

Zhang, Jingyi ^{[1
]}

Huang, Jiaxing ^{[1
]}

Jin, Sheng ^{[1
]}

Lu, Shijian ^{[1
]}

机构：

[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2024年 / 46卷 / 08期

关键词：

Task analysis; Visualization; Training; Deep learning; Surveys; Data models; Predictive models; Big Data; big model; deep learning; deep neural network; knowledge distillation; object detection; pre-training; semantic segmentation; transfer learning; vision-language model; visual recognition; image classification; CLASSIFICATION;

D O I：

10.1109/TPAMI.2024.3369699

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

引用

页码：5625 / 5644

页数：20

共 50 条

[41] ILLUME: Rationalizing Vision-Language Models through Human Interactions
Brack, Manuel
Schramowski, Patrick
Deiseroth, Bjorn
Kersting, Kristian
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
[42] Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
Wenhao Wu
Zhun Sun
Yuxin Song
Jingdong Wang
Wanli Ouyang
[J]. International Journal of Computer Vision, 2024, 132 (2) : 392 - 409
[43] Vision-Language Models Performing Zero-Shot Tasks Exhibit Disparities Between Gender Groups
Hall, Melissa
Gustafson, Laura
Adcock, Aaron
Misra, Ishan
Ross, Candace
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2770 - 2777
[44] Adapting Vision-Language Models via Learning to Inject Knowledge
Xuan, Shiyu
Yang, Ming
Zhang, Shiliang
[J]. IEEE Transactions on Image Processing, 2024, 33 : 5798 - 5809
[45] GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
Li, Xin
Lian, Dongze
Lu, Zhihe
Bai, Jiawang
Chen, Zhibo
Wang, Xinchao
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[46] Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
Ma, Chengcheng
Liu, Yang
Deng, Jiankang
Xie, Lingxi
Dong, Weiming
Xu, Changsheng
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4616 - 4629
[47] Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
Wu, Wenhao
Sun, Zhun
Song, Yuxin
Wang, Jingdong
Ouyang, Wanli
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) : 392 - 409
[48] CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Ma, Zixian
Hong, Jerry
Gul, Mustafa Omer
Ciandhi, Mona
Geo, Irena
krishna, Ranjay
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10910 - 10921
[49] Text Promptable Surgical Instrument Segmentation with Vision-Language Models
Zhou, Zijian
Alabi, Oluwatosin
Wei, Meng
Vercauteren, Tom
Shi, Miaojing
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[50] Distribution-Aware Prompt Tuning for Vision-Language Models
Cho, Eulrang
Kim, Jooyeon
Kim, Hyunwoo J.
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21947 - 21956

← 1 2 3 4 5 →