VLP: A Survey on Vision-language Pre-training

被引:53
|
作者
Chen, Fei-Long [1 ,2 ]
Zhang, Du-Zhen [1 ,3 ]
Han, Ming-Lun [1 ,3 ]
Chen, Xiu-Yi [1 ,3 ]
Shi, Jing [1 ]
Xu, Shuang [1 ]
Xu, Bo [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Sch Future Technol, Beijing 100049, Peoples R China
[3] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China
关键词
Vision and language; pre-training; transformers; multimodal learning; representation learning;
D O I
10.1007/s11633-022-1369-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances in five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.
引用
收藏
页码:38 / 56
页数:19
相关论文
共 50 条
  • [21] Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-training
    Chen, Xiaofei
    He, Yuting
    Xue, Cheng
    Ge, Rongjun
    Li, Shuo
    Yang, Guanyu
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT I, 2023, 14220 : 405 - 415
  • [22] Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends
    Gan, Zhe
    Li, Linjie
    Li, Chunyuan
    Wang, Lijuan
    Liu, Zicheng
    Gao, Jianfeng
    [J]. FOUNDATIONS AND TRENDS IN COMPUTER GRAPHICS AND VISION, 2022, 14 (3-4): : 163 - 352
  • [23] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
    Wang, Weihan
    Yang, Zhen
    Xu, Bin
    Li, Juanzi
    Sun, Yankui
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
  • [24] Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Jin, Linbo
    Chen, Ben
    Zhou, Haoming
    Qiu, Minghui
    Shao, Ling
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12642 - 12652
  • [25] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
    Liang, Mingliang
    Larson, Martha
    [J]. PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
  • [26] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
    Mu, Yao
    Zhang, Qinglong
    Hu, Mengkang
    Wang, Wenhai
    Ding, Mingyu
    Jin, Jun
    Wang, Bin
    Dai, Jifeng
    Qiao, Yu
    Luo, Ping
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [27] Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-training
    Zhang, Wenyu
    Shen, Li
    Foo, Chuan-Sheng
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
  • [28] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
    Ji, Yatai
    Wang, Junjie
    Gong, Yuan
    Zhang, Lin
    Zhu, Yanru
    Wang, Hongfa
    Zhang, Jiaxing
    Sakai, Tetsuya
    Yang, Yujiu
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
  • [29] Automated Bridge Inspection Image Interpretation Based on Vision-Language Pre-Training
    Wang, Shengyi
    El-Gohary, Nora
    [J]. COMPUTING IN CIVIL ENGINEERING 2023-DATA, SENSING, AND ANALYTICS, 2024, : 1 - 8
  • [30] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131