Position-guided Text Prompt for Vision-Language Pre-training

被引：11

作者：

Wang, Jinpeng ^{[2
]}

Zhou, Pan ^{[1
]}

Shou, Mike Zheng ^{[2
]}

Yan, Shuicheng ^{[1
]}

机构：

[1] Sea AI Lab, Singapore, Singapore

[2] Natl Univ Singapore, Show Lab, Singapore, Singapore

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/CVPR52729.2023.02226

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into N x N blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling "[P]" or "[O]" in a PTP "The block [P] has a [O]". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT [16] baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP [19] baseline. Moreover, PTP achieves comparable results with object-detector based methods [8, 23, 45], and much faster inference speed since PTP discards its object detector for inference while the later cannot.

引用

页码：23242 / 23251

页数：10

共 50 条

[1] Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts
Wang, Alex Jinpeng
Zhou, Pan
Shou, Mike Zheng
Yan, Shuicheng
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (05) : 3406 - 3421
[2] Pre-training A Prompt Pool for Vision-Language Model
Liu, Jun
Gu, Yang
Yang, Zhaohua
Guo, Shuai
Liu, Huaqiu
Chen, Yiqiang
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[3] Vision-Language Pre-Training for Boosting Scene Text Detectors
Song, Sibo
Wan, Jianqiang
Yang, Zhibo
Tang, Jun
Cheng, Wenqing
Bai, Xiang
Yao, Cong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670
[4] Survey on Vision-language Pre-training
Yin J.
Zhang Z.-D.
Gao Y.-H.
Yang Z.-W.
Li L.
Xiao M.
Sun Y.-Q.
Yan C.-G.
Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
[5] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
Liang, Mingliang
Larson, Martha
PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
[6] IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training
Liu, Che
Cheng, Sibo
Shi, Miaojing
Shah, Anand
Bai, Wenjia
Arcucci, Rossella
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2025, 44 (01) : 519 - 529
[7] Anatomical Structure-Guided Medical Vision-Language Pre-training
Li, Qingqiu
Yan, Xiaohan
Xu, Jilan
Yuan, Runtian
Zhang, Yuejie
Feng, Rui
Shen, Quanli
Zhang, Xiaobo
Wang, Shujun
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI, 2024, 15011 : 80 - 90
[8] VLP: A Survey on Vision-language Pre-training
Chen, Fei-Long
Zhang, Du-Zhen
Han, Ming-Lun
Chen, Xiu-Yi
Shi, Jing
Xu, Shuang
Xu, Bo
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (01) : 38 - 56
[9] VLP: A Survey on Vision-language Pre-training
Fei-Long Chen
Du-Zhen Zhang
Ming-Lun Han
Xiu-Yi Chen
Jing Shi
Shuang Xu
Bo Xu
Machine Intelligence Research, 2023, 20 (01) : 38 - 56
[10] VLP: A Survey on Vision-language Pre-training
Fei-Long Chen
Du-Zhen Zhang
Ming-Lun Han
Xiu-Yi Chen
Jing Shi
Shuang Xu
Bo Xu
Machine Intelligence Research, 2023, 20 : 38 - 56

← 1 2 3 4 5 →