VLP: A Survey on Vision-language Pre-training

被引：53

作者：

Chen, Fei-Long ^{[1
,2
]}

Zhang, Du-Zhen ^{[1
,3
]}

Han, Ming-Lun ^{[1
,3
]}

Chen, Xiu-Yi ^{[1
,3
]}

Shi, Jing ^{[1
]}

Xu, Shuang ^{[1
]}

Xu, Bo ^{[1
,2
,3
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Sch Future Technol, Beijing 100049, Peoples R China

[3] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China

来源：

MACHINE INTELLIGENCE RESEARCH | 2023年 / 20卷 / 01期

关键词：

Vision and language; pre-training; transformers; multimodal learning; representation learning;

D O I：

10.1007/s11633-022-1369-5

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances in five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.

引用

页码：38 / 56

页数：19

共 50 条

[1] VLP: A Survey on Vision-language Pre-training
Fei-Long Chen
Du-Zhen Zhang
Ming-Lun Han
Xiu-Yi Chen
Jing Shi
Shuang Xu
Bo Xu
[J]. Machine Intelligence Research, 2023, 20 (01) : 38 - 56
[2] VLP: A Survey on Vision-language Pre-training
Fei-Long Chen
Du-Zhen Zhang
Ming-Lun Han
Xiu-Yi Chen
Jing Shi
Shuang Xu
Bo Xu
[J]. Machine Intelligence Research, 2023, 20 : 38 - 56
[3] Survey on Vision-language Pre-training
Yin, Jiong
Zhang, Zhe-Dong
Gao, Yu-Han
Yang, Zhi-Wen
Li, Liang
Xiao, Mang
Sun, Yao-Qi
Yan, Cheng-Gang
[J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
[4] VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis
Yi, Guofeng
Fan, Cunhang
Zhu, Kang
Lv, Zhao
Liang, Shan
Wen, Zhengqi
Pei, Guanxiong
Li, Taihao
Tao, Jianhua
[J]. KNOWLEDGE-BASED SYSTEMS, 2024, 283
[5] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Jian, Yiren
Gao, Chongyang
Vosoughi, Soroush
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[6] Pre-training A Prompt Pool for Vision-Language Model
Liu, Jun
Gu, Yang
Yang, Zhaohua
Guo, Shuai
Liu, Huaqiu
Chen, Yiqiang
[J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[7] Vision-language pre-training via modal interaction
Cheng, Hang
Ye, Hehui
Zhou, Xiaofei
Liu, Ximeng
Chen, Fei
Wang, Meiqing
[J]. PATTERN RECOGNITION, 2024, 156
[8] Contrastive Vision-Language Pre-training with Limited Resources
Cui, Quan
Zhou, Boyan
Guo, Yu
Yin, Weidong
Wu, Hao
Yoshie, Osamu
Chen, Yubo
[J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 236 - 253
[9] Vision-Language Pre-Training with Triple Contrastive Learning
Yang, Jinyu
Duan, Jiali
Tran, Son
Xu, Yi
Chanda, Sampath
Chen, Liqun
Zeng, Belinda
Chilimbi, Trishul
Huang, Junzhou
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15650 - 15659
[10] Vision-Language Pre-Training for Boosting Scene Text Detectors
Song, Sibo
Wan, Jianqiang
Yang, Zhibo
Tang, Jun
Cheng, Wenqing
Bai, Xiang
Yao, Cong
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670

← 1 2 3 4 5 →