COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation

被引:21
|
作者
Wen, Keyu [1 ]
Xia, Jin [1 ]
Huang, Yuanyuan [1 ]
Li, Linyang [2 ]
Xu, Jiayan [1 ]
Shao, Jie [1 ]
机构
[1] ByteDance AI Lab, London, England
[2] Fudan Univ, Shanghai, Peoples R China
关键词
D O I
10.1109/ICCV48922.2021.00221
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There has been a recent surge of interest in cross-modal pre-training. However, existed approaches pre-train a one-stream model to learn joint vision-language representation, which suffers from calculation explosion when conducting cross-modal retrieval. In this work, we propose the Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) method to learn universal text-image representations. There are two key designs in it, one is the weight-sharing transformer on top of the visual and textual encoders to align text and image semantically, the other is three kinds of contrastive learning designed for sharing knowledge between different modalities. Cross-modal knowledge sharing greatly promotes the learning of unimodal representation. Experiments on multi-modal matching tasks including cross-modal retrieval, text matching, and image retrieval show the effectiveness and efficiency of our pre-training framework. Our COOKIE fine-tuned on cross-modal datasets MSCOCO, Flickr30K, and MSRVTT achieves new state-of-the-art results while using only 3/1000 inference time comparing to one-stream models. There are also 5.7% and 3.9% improvements in the task of image retrieval and text matching.
引用
下载
收藏
页码:2188 / 2197
页数:10
相关论文
共 50 条
  • [21] Pre-training A Prompt Pool for Vision-Language Model
    Liu, Jun
    Gu, Yang
    Yang, Zhaohua
    Guo, Shuai
    Liu, Huaqiu
    Chen, Yiqiang
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [22] UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
    Zhou, Mingyang
    Zhou, Luowei
    Wang, Shuohang
    Cheng, Yu
    Li, Linjie
    Yu, Zhou
    Liu, Jingjing
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4153 - 4163
  • [23] CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising*
    Luo, Jianjie
    Li, Yehao
    Pan, Yingwei
    Yao, Ting
    Chao, Hongyang
    Mei, Tao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5600 - 5608
  • [24] Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging
    Liang, Gongbo
    Greenwell, Connor
    Zhang, Yu
    Xing, Xin
    Wang, Xiaoqin
    Kavuluru, Ramakanth
    Jacobs, Nathan
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (04) : 1640 - 1649
  • [25] Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
    Zhang, Taolin
    He, Sunan
    Dai, Tao
    Wang, Zhi
    Chen, Bin
    Xia, Shu-Tao
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7296 - 7304
  • [26] Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer
    Jin, Woojeong
    Lee, Dong-Ho
    Zhu, Chenguang
    Pujara, Jay
    Ren, Xiang
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2750 - 2762
  • [27] CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
    Li, Hang
    Ding, Wenbiao
    Kang, Yu
    Liu, Tianqiao
    Wu, Zhongqin
    Liu, Zitao
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3966 - 3977
  • [28] Vision-Language Pre-Training for Boosting Scene Text Detectors
    Song, Sibo
    Wan, Jianqiang
    Yang, Zhibo
    Tang, Jun
    Cheng, Wenqing
    Bai, Xiang
    Yao, Cong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670
  • [29] Cross-Modal Concept Learning and Inference for Vision-Language Models
    Zhang, Yi
    Zhang, Ce
    Tang, Yushun
    He, Zhihai
    NEUROCOMPUTING, 2024, 583
  • [30] Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
    Radenovic, Filip
    Dubey, Abhimanyu
    Kadian, Abhishek
    Mihaylov, Todor
    Vandenhende, Simon
    Patel, Yash
    Wen, Yi
    Ramanathan, Vignesh
    Mahajan, Dhruv
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6967 - 6977