Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

被引:0
|
作者
Xiao Wang
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiao-Yong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
机构
[1] Peng Cheng Laboratory,School of Computer Science and Technology
[2] Anhui University,School of Computer Science
[3] Peking University,College of Computer Science
[4] Sichuan University,undefined
来源
关键词
Multi-modal (MM); pre-trained model (PTM); information fusion; representation learning; deep learning;
D O I
暂无
中图分类号
学科分类号
摘要
With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT), generative pre-trained transformers (GPT), etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey.
引用
收藏
页码:447 / 482
页数:35
相关论文
共 50 条
  • [1] Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey
    Wang, Xiao
    Chen, Guangyao
    Qian, Guangwu
    Gao, Pengcheng
    Wei, Xiao-Yong
    Wang, Yaowei
    Tian, Yonghong
    Gao, Wen
    [J]. MACHINE INTELLIGENCE RESEARCH, 2023, 20 (04) : 447 - 482
  • [2] MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation
    Bellagente, Marco
    Brack, Manuel
    Teufel, Hannah
    Friedrich, Felix
    Deiseroth, Bjoern
    Eichenberg, Constantin
    Dai, Andrew
    Baldock, Robert J. N.
    Nanda, Souradeep
    Oostermeijer, Koen
    Cruz-Salinas, Andres Felipe
    Schramowski, Patrick
    Kersting, Kristian
    Weinbach, Samuel
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [3] Richpedia: A Large-Scale, Comprehensive Multi-Modal Knowledge Graph
    Wang, Meng
    Wang, Haofen
    Qi, Guilin
    Zheng, Qiushuo
    [J]. BIG DATA RESEARCH, 2020, 22 (22)
  • [4] MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models
    Miao, Yongzhu
    Li, Shasha
    Tang, Jintao
    Wang, Ting
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 25 - 30
  • [5] Probing Multi-modal Machine Translation with Pre-trained Language Model
    Kong, Yawei
    Fan, Kai
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3689 - 3699
  • [6] PMMN: Pre-trained multi-Modal network for scene text recognition
    Zhang, Yu
    Fu, Zilong
    Huang, Fuyu
    Liu, Yizhi
    [J]. PATTERN RECOGNITION LETTERS, 2021, 151 : 103 - 111
  • [7] Fast multi-modal reuse: Co-occurrence pre-trained deep learning models
    Iyer, Vasanth
    Aved, Alexander
    Howlett, Todd B.
    Carlo, Jeffrey T.
    Mehmood, Asif
    Pissinou, Niki
    Iyengar, S.S.
    [J]. Proceedings of SPIE - The International Society for Optical Engineering, 2019, 10996
  • [8] Fast Multi-Modal Reuse: Co-Occurrence Pre-Trained Deep Learning Models
    Iyer, Vasanth
    Aved, Alexander
    Howlett, Todd B.
    Carlo, Jeffrey T.
    Mehmood, Asif
    Pissinou, Niki
    Iyengar, S. S.
    [J]. REAL-TIME IMAGE PROCESSING AND DEEP LEARNING 2019, 2019, 10996
  • [9] Difference between Multi-modal vs. Text Pre-trained Models in Embedding Text
    Sun, Yuchong
    Cheng, Xiwei
    Song, Ruihua
    Che, Wanxiang
    Lu, Zhiwu
    Wen, Jirong
    [J]. Beijing Daxue Xuebao (Ziran Kexue Ban)/Acta Scientiarum Naturalium Universitatis Pekinensis, 2023, 59 (01): : 48 - 56
  • [10] Exploring the Application of Large-Scale Pre-Trained Models on Adverse Weather Removal
    Tan, Zhentao
    Wu, Yue
    Liu, Qiankun
    Chu, Qi
    Lu, Le
    Ye, Jieping
    Yu, Nenghai
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1683 - 1698