Foundation Models Defining a New Era in Vision: A Survey and Outlook

被引:1
|
作者
Awais, Muhammad [1 ,2 ]
Naseer, Muzammal [3 ,4 ,5 ]
Khan, Salman [1 ,5 ]
Anwer, Rao Muhammad [1 ]
Cholakkal, Hisham [1 ]
Shah, Mubarak [6 ]
Yang, Ming-Hsuan [7 ,8 ,9 ]
Khan, Fahad Shahbaz [1 ,10 ]
机构
[1] MBZ Univ AI, Abu Dhabi, U Arab Emirates
[2] Georgia Inst Technol, Comp Sci Dept, Atlanta, GA 30332 USA
[3] Khalifa Univ, Comp Sci Dept, Abu Dhabi, U Arab Emirates
[4] Khalifa Univ, Ctr Secure Cyber Phys Secur Syst, Abu Dhabi, U Arab Emirates
[5] Australian Natl Univ, CECS, Canberra, ACT 0200, Australia
[6] Univ Cent Florida, Ctr Res Comp Vis, Orlando, FL 32816 USA
[7] Univ Calif Merced, Merced, CA 95344 USA
[8] Yonsei Univ, Seoul 03722, South Korea
[9] Google Res, Mountain View, CA 94043 USA
[10] Linkoping Univ, Comp Vis Lab, S-58183 Linkoping, Sweden
关键词
Adaptation models; Computational modeling; Foundation models; Data models; Surveys; Visualization; Reviews; Computer vision; Computer architecture; Context modeling; Contrastive learning; language and vision; large language models; masked modeling; self-supervised learning; LANGUAGE;
D O I
10.1109/TPAMI.2024.3506283
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.
引用
收藏
页码:2245 / 2264
页数:20
相关论文
共 50 条
  • [31] HEAT-wave - defining a new era of cyberthreats
    Lee J.
    Computer Fraud and Security, 2022, 2022 (06):
  • [32] SARS spreads new outlook on quarantine models
    Diamond, B
    NATURE MEDICINE, 2003, 9 (12) : 1441 - 1441
  • [33] SARS spreads new outlook on quarantine models
    Bruce Diamond
    Nature Medicine, 2003, 9 : 1441 - 1441
  • [34] A New Era for Civil Wars: An Editorial Vision
    Worrall, James
    Waterman, Alex
    CIVIL WARS, 2022, 24 (01) : 1 - 5
  • [35] A New Era in Human Factors Engineering: A Survey of the Applications and Prospects of Large Multimodal Models
    Li, Fan
    Han, Su
    Lee, Ching-Hung
    Feng, Shanshan
    Jiang, Zhuoxuan
    Sun, Zhu
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2025,
  • [36] Unlocking Robotic Autonomy: A Survey on the Applications of Foundation Models
    Jang, Dae-Sung
    Cho, Doo-Hyun
    Lee, Woo-Cheol
    Ryu, Seung-Keol
    Jeong, Byeongmin
    Hong, Minji
    Jung, Minjo
    Kim, Minchae
    Lee, Minjoon
    Lee, Seungjae
    Choi, Han-Lim
    INTERNATIONAL JOURNAL OF CONTROL AUTOMATION AND SYSTEMS, 2024, 22 (08) : 2341 - 2384
  • [37] Training and Serving System of Foundation Models: A Comprehensive Survey
    Zhou, Jiahang
    Chen, Yanyu
    Hong, Zicong
    Chen, Wuhui
    Yu, Yue
    Zhang, Tao
    Wang, Hui
    Zhang, Chuanfu
    Zheng, Zibin
    IEEE OPEN JOURNAL OF THE COMPUTER SOCIETY, 2024, 5 : 107 - 119
  • [38] Measuring Business Trends and Outlook through a New Survey
    Buffington, Catherine
    Foster, Lucia
    Shevlin, Colin
    AEA PAPERS AND PROCEEDINGS, 2023, 113 : 140 - 144
  • [39] Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
    Zhang, Xinsong
    Zeng, Yan
    Zhang, Jipeng
    Li, Hang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 551 - 568
  • [40] CEO SURVEY SIGNALS NEW ERA
    GRAHAM, GW
    HOSPITALS, 1986, 60 (03): : 8 - 8