VLP: A Survey on Vision-language Pre-training

被引：53

作者：

Chen, Fei-Long ^{[1
,2
]}

Zhang, Du-Zhen ^{[1
,3
]}

Han, Ming-Lun ^{[1
,3
]}

Chen, Xiu-Yi ^{[1
,3
]}

Shi, Jing ^{[1
]}

Xu, Shuang ^{[1
]}

Xu, Bo ^{[1
,2
,3
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Sch Future Technol, Beijing 100049, Peoples R China

[3] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China

来源：

MACHINE INTELLIGENCE RESEARCH | 2023年 / 20卷 / 01期

关键词：

Vision and language; pre-training; transformers; multimodal learning; representation learning;

D O I：

10.1007/s11633-022-1369-5

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances in five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.

引用

页码：38 / 56

页数：19

共 50 条

[21] Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-training
Chen, Xiaofei
He, Yuting
Xue, Cheng
Ge, Rongjun
Li, Shuo
Yang, Guanyu
[J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT I, 2023, 14220 : 405 - 415
[22] Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends
Gan, Zhe
Li, Linjie
Li, Chunyuan
Wang, Lijuan
Liu, Zicheng
Gao, Jianfeng
[J]. FOUNDATIONS AND TRENDS IN COMPUTER GRAPHICS AND VISION, 2022, 14 (3-4): : 163 - 352
[23] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Wang, Weihan
Yang, Zhen
Xu, Bin
Li, Juanzi
Sun, Yankui
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
[24] Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Zhuge, Mingchen
Gao, Dehong
Fan, Deng-Ping
Jin, Linbo
Chen, Ben
Zhou, Haoming
Qiu, Minghui
Shao, Ling
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12642 - 12652
[25] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
Liang, Mingliang
Larson, Martha
[J]. PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
[26] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Mu, Yao
Zhang, Qinglong
Hu, Mengkang
Wang, Wenhai
Ding, Mingyu
Jin, Jun
Wang, Bin
Dai, Jifeng
Qiao, Yu
Luo, Ping
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[27] Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-training
Zhang, Wenyu
Shen, Li
Foo, Chuan-Sheng
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
[28] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Ji, Yatai
Wang, Junjie
Gong, Yuan
Zhang, Lin
Zhu, Yanru
Wang, Hongfa
Zhang, Jiaxing
Sakai, Tetsuya
Yang, Yujiu
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
[29] Automated Bridge Inspection Image Interpretation Based on Vision-Language Pre-Training
Wang, Shengyi
El-Gohary, Nora
[J]. COMPUTING IN CIVIL ENGINEERING 2023-DATA, SENSING, AND ANALYTICS, 2024, : 1 - 8
[30] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Liu, Zikang
Chen, Sihan
Guo, Longteng
Li, Handong
He, Xingjian
Liu, Jing
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131

← 1 2 3 4 5 →