COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation

被引：21

作者：

Wen, Keyu ^{[1
]}

Xia, Jin ^{[1
]}

Huang, Yuanyuan ^{[1
]}

Li, Linyang ^{[2
]}

Xu, Jiayan ^{[1
]}

Shao, Jie ^{[1
]}

机构：

[1] ByteDance AI Lab, London, England

[2] Fudan Univ, Shanghai, Peoples R China

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

关键词：

D O I：

10.1109/ICCV48922.2021.00221

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

There has been a recent surge of interest in cross-modal pre-training. However, existed approaches pre-train a one-stream model to learn joint vision-language representation, which suffers from calculation explosion when conducting cross-modal retrieval. In this work, we propose the Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) method to learn universal text-image representations. There are two key designs in it, one is the weight-sharing transformer on top of the visual and textual encoders to align text and image semantically, the other is three kinds of contrastive learning designed for sharing knowledge between different modalities. Cross-modal knowledge sharing greatly promotes the learning of unimodal representation. Experiments on multi-modal matching tasks including cross-modal retrieval, text matching, and image retrieval show the effectiveness and efficiency of our pre-training framework. Our COOKIE fine-tuned on cross-modal datasets MSCOCO, Flickr30K, and MSRVTT achieves new state-of-the-art results while using only 3/1000 inference time comparing to one-stream models. There are also 5.7% and 3.9% improvements in the task of image retrieval and text matching.

引用

下载

页码：2188 / 2197

页数：10

共 50 条

[1] VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification
Bakkali, Souhail
Ming, Zuheng
Coustaty, Mickael
Rusinol, Marcal
Ramos Terrades, Oriol
PATTERN RECOGNITION, 2023, 139
[2] VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Wang, Teng
Jiang, Wenhao
Lu, Zhichao
Zheng, Feng
Cheng, Ran
Yin, Chengguo
Luo, Ping
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[3] Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
Jiang, Chaoya
Ye, Wei
Xu, Haiyang
Huang, Songfang
Huang, Fei
Zhang, Shikun
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14660 - 14679
[4] CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
Ma, Zhiyuan
Li, Jianjun
Li, Guohui
Huang, Kaiyan
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4515 - 4524
[5] Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-training
Chen, Xiaofei
He, Yuting
Xue, Cheng
Ge, Rongjun
Li, Shuo
Yang, Guanyu
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT I, 2023, 14220 : 405 - 415
[6] Contrastive Vision-Language Pre-training with Limited Resources
Cui, Quan
Zhou, Boyan
Guo, Yu
Yin, Weidong
Wu, Hao
Yoshie, Osamu
Chen, Yubo
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 236 - 253
[7] Vision-Language Pre-Training with Triple Contrastive Learning
Yang, Jinyu
Duan, Jiali
Tran, Son
Xu, Yi
Chanda, Sampath
Chen, Liqun
Zeng, Belinda
Chilimbi, Trishul
Huang, Junzhou
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15650 - 15659
[8] Vision-language pre-training via modal interaction
Cheng, Hang
Ye, Hehui
Zhou, Xiaofei
Liu, Ximeng
Chen, Fei
Wang, Meiqing
PATTERN RECOGNITION, 2024, 156
[9] PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
Guo, Zixin
Wang, Tzu-Jui Julius
Pehlivan, Selen
Radman, Abduljalil
Laaksonen, Jorma
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2261 - 2265
[10] COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Lu, Haoyu
Fei, Nanyi
Huo, Yuqi
Gao, Yizhao
Lu, Zhiwu
Wen, Ji-Rong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15671 - 15680

← 1 2 3 4 5 →