Foundation Models Defining a New Era in Vision: A Survey and Outlook

被引：1

作者：

Awais, Muhammad ^{[1
,2
]}

Naseer, Muzammal ^{[3
,4
,5
]}

Khan, Salman ^{[1
,5
]}

Anwer, Rao Muhammad ^{[1
]}

Cholakkal, Hisham ^{[1
]}

Shah, Mubarak ^{[6
]}

Yang, Ming-Hsuan ^{[7
,8
,9
]}

Khan, Fahad Shahbaz ^{[1
,10
]}

机构：

[1] MBZ Univ AI, Abu Dhabi, U Arab Emirates

[2] Georgia Inst Technol, Comp Sci Dept, Atlanta, GA 30332 USA

[3] Khalifa Univ, Comp Sci Dept, Abu Dhabi, U Arab Emirates

[4] Khalifa Univ, Ctr Secure Cyber Phys Secur Syst, Abu Dhabi, U Arab Emirates

[5] Australian Natl Univ, CECS, Canberra, ACT 0200, Australia

[6] Univ Cent Florida, Ctr Res Comp Vis, Orlando, FL 32816 USA

[7] Univ Calif Merced, Merced, CA 95344 USA

[8] Yonsei Univ, Seoul 03722, South Korea

[9] Google Res, Mountain View, CA 94043 USA

[10] Linkoping Univ, Comp Vis Lab, S-58183 Linkoping, Sweden

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2025年 / 47卷 / 04期

关键词：

Adaptation models; Computational modeling; Foundation models; Data models; Surveys; Visualization; Reviews; Computer vision; Computer architecture; Context modeling; Contrastive learning; language and vision; large language models; masked modeling; self-supervised learning; LANGUAGE;

D O I：

10.1109/TPAMI.2024.3506283

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.

引用

页码：2245 / 2264

页数：20

共 50 条

[21] A Survey of Deep Active Learning for Foundation Models
Wan, Tianjiao
Xu, Kele
Yu, Ting
Wang, Xu
Feng, Dawei
Ding, Bo
Wang, Huaimin
Intelligent Computing, 2023, 2
[22] MODELS OF QUINES NEW FOUNDATION
HINNION, R
COMPTES RENDUS HEBDOMADAIRES DES SEANCES DE L ACADEMIE DES SCIENCES SERIE A, 1972, 275 (13): : 567 - &
[23] Triboelectric Nanogenerator: A Foundation of the Energy for the New Era
Wu, Changsheng
Wang, Aurelia C.
Ding, Wenbo
Guo, Hengyu
Wang, Zhong Lin
ADVANCED ENERGY MATERIALS, 2019, 9 (01)
[24] Debiasing vision-language models for vision tasks: a survey
Zhu, Beier
Zhang, Hanwang
FRONTIERS OF COMPUTER SCIENCE, 2025, 19 (01)
[25] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
Salin, Emmanuelle
Ayache, Stephane
Favre, Benoit
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
[26] Self-supervised anomaly detection in computer vision and beyond: A survey and outlook
Hojjati, Hadi
Ho, Thi Kieu Khanh
Armanfard, Narges
NEURAL NETWORKS, 2024, 172
[27] The property market in the New Economic Era: the outlook for 2016 and beyond
McIntosh, Angus
JOURNAL OF PROPERTY INVESTMENT & FINANCE, 2016, 34 (02) : 102 - 106
[28] Defining a new era in modeling and simulations: Trends and challenges
Varma-O-Brien, Shikha
Yan, Lisa
Guzman-Hernandez, Francisco
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2010, 239
[29] How Software Is Defining the New Era of Laboratory Science
Edelman, Jonathan
LCGC NORTH AMERICA, 2023, 41 (08) : 358 - 358
[30] Defining clinical performance specifications in the new IVD era
Lord, S. J.
CLINICA CHIMICA ACTA, 2019, 493 : S747 - S747

← 1 2 3 4 5 →