COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation

被引:21
|
作者
Wen, Keyu [1 ]
Xia, Jin [1 ]
Huang, Yuanyuan [1 ]
Li, Linyang [2 ]
Xu, Jiayan [1 ]
Shao, Jie [1 ]
机构
[1] ByteDance AI Lab, London, England
[2] Fudan Univ, Shanghai, Peoples R China
关键词
D O I
10.1109/ICCV48922.2021.00221
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There has been a recent surge of interest in cross-modal pre-training. However, existed approaches pre-train a one-stream model to learn joint vision-language representation, which suffers from calculation explosion when conducting cross-modal retrieval. In this work, we propose the Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) method to learn universal text-image representations. There are two key designs in it, one is the weight-sharing transformer on top of the visual and textual encoders to align text and image semantically, the other is three kinds of contrastive learning designed for sharing knowledge between different modalities. Cross-modal knowledge sharing greatly promotes the learning of unimodal representation. Experiments on multi-modal matching tasks including cross-modal retrieval, text matching, and image retrieval show the effectiveness and efficiency of our pre-training framework. Our COOKIE fine-tuned on cross-modal datasets MSCOCO, Flickr30K, and MSRVTT achieves new state-of-the-art results while using only 3/1000 inference time comparing to one-stream models. There are also 5.7% and 3.9% improvements in the task of image retrieval and text matching.
引用
下载
收藏
页码:2188 / 2197
页数:10
相关论文
共 50 条
  • [41] Learning From Expert: Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval
    Sun, Lina
    Li, Yewen
    Dong, Yumin
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 499 - 507
  • [42] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
    Moon, Jong Hak
    Lee, Hyungyung
    Shin, Woncheol
    Kim, Young-Hak
    Choi, Edward
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (12) : 6070 - 6080
  • [43] Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering
    Liu, Gang
    He, Jinlong
    Li, Pengfei
    Zhao, Zixu
    Zhong, Shenjun
    Journal of Biomedical Informatics, 2024, 160
  • [44] Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
    Pandey, Rohan
    Shao, Rulin
    Liang, Paul Pu
    Salakhutdinov, Ruslan
    Morency, Louis-Philippe
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5444 - 5455
  • [45] Position-guided Text Prompt for Vision-Language Pre-training
    Wang, Jinpeng
    Zhou, Pan
    Shou, Mike Zheng
    Yan, Shuicheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23242 - 23251
  • [46] VLDeformer: Vision-Language Decomposed Transformer for fast cross-modal retrieval
    Zhang, Lisai
    Wu, Hongfa
    Chen, Qingcai
    Deng, Yimeng
    Siebert, Joanna
    Li, Zhonghua
    Han, Yunpeng
    Kong, Dejiang
    Cao, Zhao
    KNOWLEDGE-BASED SYSTEMS, 2022, 252
  • [47] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
    Wang, Weihan
    Yang, Zhen
    Xu, Bin
    Li, Juanzi
    Sun, Yankui
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
  • [48] Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends
    Gan, Zhe
    Li, Linjie
    Li, Chunyuan
    Wang, Lijuan
    Liu, Zicheng
    Gao, Jianfeng
    FOUNDATIONS AND TRENDS IN COMPUTER GRAPHICS AND VISION, 2022, 14 (3-4): : 163 - 352
  • [49] Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Jin, Linbo
    Chen, Ben
    Zhou, Haoming
    Qiu, Minghui
    Shao, Ling
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12642 - 12652
  • [50] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
    Liang, Mingliang
    Larson, Martha
    PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67