Position-guided Text Prompt for Vision-Language Pre-training

被引：11

作者：

Wang, Jinpeng ^{[2
]}

Zhou, Pan ^{[1
]}

Shou, Mike Zheng ^{[2
]}

Yan, Shuicheng ^{[1
]}

机构：

[1] Sea AI Lab, Singapore, Singapore

[2] Natl Univ Singapore, Show Lab, Singapore, Singapore

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/CVPR52729.2023.02226

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into N x N blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling "[P]" or "[O]" in a PTP "The block [P] has a [O]". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT [16] baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP [19] baseline. Moreover, PTP achieves comparable results with object-detector based methods [8, 23, 45], and much faster inference speed since PTP discards its object detector for inference while the later cannot.

引用

页码：23242 / 23251

页数：10

共 50 条

[41] Fine-Grained Semantically Aligned Vision-Language Pre-Training
Li, Juncheng
He, Xin
Wei, Longhui
Qian, Long
Zhu, Linchao
Xie, Lingxi
Zhuang, Yueting
Tian, Qi
Tang, Siliang
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[42] Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval
Yao, Tao
Peng, Shouyong
Wang, Lili
Li, Ying
Sun, Yujuan
APPLIED INTELLIGENCE, 2024, 54 (23) : 12230 - 12245
[43] COPA : Efficient Vision-Language Pre-training through Collaborative Object- and Patch-Text Alignment
Jiang, Chaoya
Xu, Haiyang
Ye, Wei
Ye, Qinghao
Li, Chenliang
Yan, Ming
Bi, Bin
Zhang, Shikun
Huang, Fei
Zhang, Ji
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4480 - 4491
[44] IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training
Huang, Xinyu
Zhang, Youcai
Cheng, Ying
Tian, Weiwei
Zhao, Ruiwei
Feng, Rui
Zhang, Yuejie
Li, Yaqian
Guo, Yandong
Zhang, Xiaobo
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4573 - 4583
[45] MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search
Zheng, Xiaoyang
Wang, Zilong
Li, Sen
Xu, Ke
Zhuang, Tao
Liu, Qingwen
Zeng, Xiaoyi
COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 356 - 360
[46] VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Bao, Hangbo
Wang, Wenhui
Dong, Li
Liu, Qiang
Mohammed, Owais Khan
Aggarwal, Kriti
Som, Subhojit
Piao, Songhao
Wei, Furu
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[47] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Liu, Zikang
Chen, Sihan
Guo, Longteng
Li, Handong
He, Xingjian
Liu, Jing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131
[48] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Ji, Yatai
Wang, Junjie
Gong, Yuan
Zhang, Lin
Zhu, Yanru
Wang, Hongfa
Zhang, Jiaxing
Sakai, Tetsuya
Yang, Yujiu
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
[49] Automated Bridge Inspection Image Interpretation Based on Vision-Language Pre-Training
Wang, Shengyi
El-Gohary, Nora
COMPUTING IN CIVIL ENGINEERING 2023-DATA, SENSING, AND ANALYTICS, 2024, : 1 - 8
[50] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Liu, Zikang
Chen, Sihan
Guo, Longteng
Li, Handong
He, Xingjian
Liu, Jing
arXiv, 2023,

← 1 2 3 4 5 →