COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation

被引：21

作者：

Wen, Keyu ^{[1
]}

Xia, Jin ^{[1
]}

Huang, Yuanyuan ^{[1
]}

Li, Linyang ^{[2
]}

Xu, Jiayan ^{[1
]}

Shao, Jie ^{[1
]}

机构：

[1] ByteDance AI Lab, London, England

[2] Fudan Univ, Shanghai, Peoples R China

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

关键词：

D O I：

10.1109/ICCV48922.2021.00221

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

There has been a recent surge of interest in cross-modal pre-training. However, existed approaches pre-train a one-stream model to learn joint vision-language representation, which suffers from calculation explosion when conducting cross-modal retrieval. In this work, we propose the Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) method to learn universal text-image representations. There are two key designs in it, one is the weight-sharing transformer on top of the visual and textual encoders to align text and image semantically, the other is three kinds of contrastive learning designed for sharing knowledge between different modalities. Cross-modal knowledge sharing greatly promotes the learning of unimodal representation. Experiments on multi-modal matching tasks including cross-modal retrieval, text matching, and image retrieval show the effectiveness and efficiency of our pre-training framework. Our COOKIE fine-tuned on cross-modal datasets MSCOCO, Flickr30K, and MSRVTT achieves new state-of-the-art results while using only 3/1000 inference time comparing to one-stream models. There are also 5.7% and 3.9% improvements in the task of image retrieval and text matching.

引用

下载

页码：2188 / 2197

页数：10

共 50 条

[21] Pre-training A Prompt Pool for Vision-Language Model
Liu, Jun
Gu, Yang
Yang, Zhaohua
Guo, Shuai
Liu, Huaqiu
Chen, Yiqiang
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[22] UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
Zhou, Mingyang
Zhou, Luowei
Wang, Shuohang
Cheng, Yu
Li, Linjie
Yu, Zhou
Liu, Jingjing
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4153 - 4163
[23] CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising*
Luo, Jianjie
Li, Yehao
Pan, Yingwei
Yao, Ting
Chao, Hongyang
Mei, Tao
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5600 - 5608
[24] Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging
Liang, Gongbo
Greenwell, Connor
Zhang, Yu
Xing, Xin
Wang, Xiaoqin
Kavuluru, Ramakanth
Jacobs, Nathan
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (04) : 1640 - 1649
[25] Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
Zhang, Taolin
He, Sunan
Dai, Tao
Wang, Zhi
Chen, Bin
Xia, Shu-Tao
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7296 - 7304
[26] Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer
Jin, Woojeong
Lee, Dong-Ho
Zhu, Chenguang
Pujara, Jay
Ren, Xiang
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2750 - 2762
[27] CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
Li, Hang
Ding, Wenbiao
Kang, Yu
Liu, Tianqiao
Wu, Zhongqin
Liu, Zitao
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3966 - 3977
[28] Vision-Language Pre-Training for Boosting Scene Text Detectors
Song, Sibo
Wan, Jianqiang
Yang, Zhibo
Tang, Jun
Cheng, Wenqing
Bai, Xiang
Yao, Cong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670
[29] Cross-Modal Concept Learning and Inference for Vision-Language Models
Zhang, Yi
Zhang, Ce
Tang, Yushun
He, Zhihai
NEUROCOMPUTING, 2024, 583
[30] Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Radenovic, Filip
Dubey, Abhimanyu
Kadian, Abhishek
Mihaylov, Todor
Vandenhende, Simon
Patel, Yash
Wen, Yi
Ramanathan, Vignesh
Mahajan, Dhruv
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6967 - 6977

← 1 2 3 4 5 →