Vision-Language Pre-Training with Triple Contrastive Learning

被引:61
|
作者
Yang, Jinyu [1 ,2 ]
Duan, Jiali [2 ]
Tran, Son [2 ]
Xu, Yi [2 ]
Chanda, Sampath [2 ]
Chen, Liqun [2 ]
Zeng, Belinda [2 ]
Chilimbi, Trishul [2 ]
Huang, Junzhou [1 ]
机构
[1] Univ Texas Arlington, Arlington, TX 76019 USA
[2] Amazon, Seattle, WA USA
基金
美国国家科学基金会;
关键词
D O I
10.1109/CVPR52688.2022.01522
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However; simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded representations. For instance, although CMA-based models are able to map image-text pairs close together in the embedding space, they fail to ensure that similar inputs from the same modality stay close by. This problem can get even worse when the pre-training data is noisy. In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieves the new state of the art on various common down-stream vision-language tasks such as image-text retrieval and visual question answering.
引用
收藏
页码:15650 / 15659
页数:10
相关论文
共 50 条
  • [1] Contrastive Vision-Language Pre-training with Limited Resources
    Cui, Quan
    Zhou, Boyan
    Guo, Yu
    Yin, Weidong
    Wu, Hao
    Yoshie, Osamu
    Chen, Yubo
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 236 - 253
  • [2] Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-training
    Chen, Xiaofei
    He, Yuting
    Xue, Cheng
    Ge, Rongjun
    Li, Shuo
    Yang, Guanyu
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT I, 2023, 14220 : 405 - 415
  • [3] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
    Jian, Yiren
    Gao, Chongyang
    Vosoughi, Soroush
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] Survey on Vision-language Pre-training
    Yin, Jiong
    Zhang, Zhe-Dong
    Gao, Yu-Han
    Yang, Zhi-Wen
    Li, Liang
    Xiao, Mang
    Sun, Yao-Qi
    Yan, Cheng-Gang
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
  • [5] Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
    Zhang, Taolin
    He, Sunan
    Dai, Tao
    Wang, Zhi
    Chen, Bin
    Xia, Shu-Tao
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7296 - 7304
  • [6] Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
    Wang, Tzu-Jui Julius
    Laaksonen, Jorma
    Langer, Tomas
    Arponen, Heikki
    Bishop, Tom E.
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1073 - 1083
  • [7] VLP: A Survey on Vision-language Pre-training
    Chen, Fei-Long
    Zhang, Du-Zhen
    Han, Ming-Lun
    Chen, Xiu-Yi
    Shi, Jing
    Xu, Shuang
    Xu, Bo
    [J]. MACHINE INTELLIGENCE RESEARCH, 2023, 20 (01) : 38 - 56
  • [8] VLP: A Survey on Vision-language Pre-training
    Fei-Long Chen
    Du-Zhen Zhang
    Ming-Lun Han
    Xiu-Yi Chen
    Jing Shi
    Shuang Xu
    Bo Xu
    [J]. Machine Intelligence Research, 2023, 20 (01) : 38 - 56
  • [9] VLP: A Survey on Vision-language Pre-training
    Fei-Long Chen
    Du-Zhen Zhang
    Ming-Lun Han
    Xiu-Yi Chen
    Jing Shi
    Shuang Xu
    Bo Xu
    [J]. Machine Intelligence Research, 2023, 20 : 38 - 56
  • [10] Pre-training A Prompt Pool for Vision-Language Model
    Liu, Jun
    Gu, Yang
    Yang, Zhaohua
    Guo, Shuai
    Liu, Huaqiu
    Chen, Yiqiang
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,