COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation

被引：21

作者：

Wen, Keyu ^{[1
]}

Xia, Jin ^{[1
]}

Huang, Yuanyuan ^{[1
]}

Li, Linyang ^{[2
]}

Xu, Jiayan ^{[1
]}

Shao, Jie ^{[1
]}

机构：

[1] ByteDance AI Lab, London, England

[2] Fudan Univ, Shanghai, Peoples R China

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

关键词：

D O I：

10.1109/ICCV48922.2021.00221

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

There has been a recent surge of interest in cross-modal pre-training. However, existed approaches pre-train a one-stream model to learn joint vision-language representation, which suffers from calculation explosion when conducting cross-modal retrieval. In this work, we propose the Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) method to learn universal text-image representations. There are two key designs in it, one is the weight-sharing transformer on top of the visual and textual encoders to align text and image semantically, the other is three kinds of contrastive learning designed for sharing knowledge between different modalities. Cross-modal knowledge sharing greatly promotes the learning of unimodal representation. Experiments on multi-modal matching tasks including cross-modal retrieval, text matching, and image retrieval show the effectiveness and efficiency of our pre-training framework. Our COOKIE fine-tuned on cross-modal datasets MSCOCO, Flickr30K, and MSRVTT achieves new state-of-the-art results while using only 3/1000 inference time comparing to one-stream models. There are also 5.7% and 3.9% improvements in the task of image retrieval and text matching.

引用

下载

页码：2188 / 2197

页数：10

共 50 条

[41] Learning From Expert: Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval
Sun, Lina
Li, Yewen
Dong, Yumin
PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 499 - 507
[42] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
Moon, Jong Hak
Lee, Hyungyung
Shin, Woncheol
Kim, Young-Hak
Choi, Edward
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (12) : 6070 - 6080
[43] Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering
Liu, Gang
He, Jinlong
Li, Pengfei
Zhao, Zixu
Zhong, Shenjun
Journal of Biomedical Informatics, 2024, 160
[44] Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
Pandey, Rohan
Shao, Rulin
Liang, Paul Pu
Salakhutdinov, Ruslan
Morency, Louis-Philippe
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5444 - 5455
[45] Position-guided Text Prompt for Vision-Language Pre-training
Wang, Jinpeng
Zhou, Pan
Shou, Mike Zheng
Yan, Shuicheng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23242 - 23251
[46] VLDeformer: Vision-Language Decomposed Transformer for fast cross-modal retrieval
Zhang, Lisai
Wu, Hongfa
Chen, Qingcai
Deng, Yimeng
Siebert, Joanna
Li, Zhonghua
Han, Yunpeng
Kong, Dejiang
Cao, Zhao
KNOWLEDGE-BASED SYSTEMS, 2022, 252
[47] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Wang, Weihan
Yang, Zhen
Xu, Bin
Li, Juanzi
Sun, Yankui
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
[48] Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends
Gan, Zhe
Li, Linjie
Li, Chunyuan
Wang, Lijuan
Liu, Zicheng
Gao, Jianfeng
FOUNDATIONS AND TRENDS IN COMPUTER GRAPHICS AND VISION, 2022, 14 (3-4): : 163 - 352
[49] Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Zhuge, Mingchen
Gao, Dehong
Fan, Deng-Ping
Jin, Linbo
Chen, Ben
Zhou, Haoming
Qiu, Minghui
Shao, Ling
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12642 - 12652
[50] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
Liang, Mingliang
Larson, Martha
PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67

← 1 2 3 4 5 →