Foundation Models Defining a New Era in Vision: A Survey and Outlook

被引:1
|
作者
Awais, Muhammad [1 ,2 ]
Naseer, Muzammal [3 ,4 ,5 ]
Khan, Salman [1 ,5 ]
Anwer, Rao Muhammad [1 ]
Cholakkal, Hisham [1 ]
Shah, Mubarak [6 ]
Yang, Ming-Hsuan [7 ,8 ,9 ]
Khan, Fahad Shahbaz [1 ,10 ]
机构
[1] MBZ Univ AI, Abu Dhabi, U Arab Emirates
[2] Georgia Inst Technol, Comp Sci Dept, Atlanta, GA 30332 USA
[3] Khalifa Univ, Comp Sci Dept, Abu Dhabi, U Arab Emirates
[4] Khalifa Univ, Ctr Secure Cyber Phys Secur Syst, Abu Dhabi, U Arab Emirates
[5] Australian Natl Univ, CECS, Canberra, ACT 0200, Australia
[6] Univ Cent Florida, Ctr Res Comp Vis, Orlando, FL 32816 USA
[7] Univ Calif Merced, Merced, CA 95344 USA
[8] Yonsei Univ, Seoul 03722, South Korea
[9] Google Res, Mountain View, CA 94043 USA
[10] Linkoping Univ, Comp Vis Lab, S-58183 Linkoping, Sweden
关键词
Adaptation models; Computational modeling; Foundation models; Data models; Surveys; Visualization; Reviews; Computer vision; Computer architecture; Context modeling; Contrastive learning; language and vision; large language models; masked modeling; self-supervised learning; LANGUAGE;
D O I
10.1109/TPAMI.2024.3506283
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.
引用
收藏
页码:2245 / 2264
页数:20
相关论文
共 50 条
  • [1] Propagation Models for Body-Area Networks: A Survey and New Outlook
    Smith, David B.
    Miniutti, Dino
    Lamahewa, Tharaka A.
    Hanlen, Leif W.
    IEEE ANTENNAS AND PROPAGATION MAGAZINE, 2013, 55 (05) : 97 - 117
  • [2] From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
    Huang, Kung-Hsiang
    Chan, Hou Pong
    Fung, May
    Qiu, Haoyi
    Zhou, Mingyang
    Joty, Shafiq
    Chang, Shih-Fu
    Ji, Heng
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2025, 37 (05) : 2550 - 2568
  • [4] Sapiens: Foundation for Human Vision Models
    Khirodkar, Rawal
    Bagautdinov, Timur
    Martinez, Julieta
    Zhaoen, Su
    James, Austin
    Selednik, Peter
    Anderson, Stuart
    Saito, Shunsuke
    COMPUTER VISION-ECCV 2024, PT IV, 2025, 15062 : 206 - 228
  • [5] Agents with foundation models: advance and vision
    Gong, Chenghua
    Li, Xiang
    FRONTIERS OF COMPUTER SCIENCE, 2025, 19 (04)
  • [6] Rethinking Software Engineering in the Era of Foundation Models
    Hassan, Ahmed E.
    Lin, Dayi
    Rajbahadur, Gopi Krishnan
    Gallaba, Keheliya
    Cogo, Filipe Roseiro
    Chen, Boyuan
    Zhang, Haoxiang
    Thangarajah, Kishanthan
    Oliva, Gustavo
    Lin, Jiahuei
    Abdullah, Wali Mohammad
    Jiang, Zhen Ming
    COMPANION PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, FSE COMPANION 2024, 2024, : 294 - 305
  • [7] Adapting Vision Foundation Models for Plant Phenotyping
    Chen, Feng
    Giuffrida, Mario Valerio
    Tsaftaris, Sotirios A.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 604 - 613
  • [8] Diffusion Models in Vision: A Survey
    Croitoru, Florinel-Alin
    Hondru, Vlad
    Ionescu, Radu Tudor
    Shah, Mubarak
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (09) : 10850 - 10869
  • [9] Defining a Vision and Laying the Foundation for Integrated Care of Older Adults in Algoma
    Chlebus, Victoria Aceti
    Corsi, Dana
    INTERNATIONAL JOURNAL OF INTEGRATED CARE, 2022, 22
  • [10] The Era of Large Models: A New Starting Point for Electric Power Vision Technology
    Zhao Z.
    Feng S.
    Xi Y.
    Zhang J.
    Zhai Y.
    Zhao W.
    Gaodianya Jishu/High Voltage Engineering, 2024, 50 (05): : 1813 - 1825