Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

被引:7
|
作者
Radenovic, Filip [1 ]
Dubey, Abhimanyu [1 ]
Kadian, Abhishek [1 ]
Mihaylov, Todor [1 ]
Vandenhende, Simon [1 ]
Patel, Yash [2 ]
Wen, Yi [1 ]
Ramanathan, Vignesh [1 ]
Mahajan, Dhruv [1 ]
机构
[1] Meta AI, New York, NY 10003 USA
[2] Czech Tech Univ, Prague, Czech Republic
关键词
D O I
10.1109/CVPR52729.2023.00673
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work. Models are available at github.com/facebookresearch/diht.
引用
收藏
页码:6967 / 6977
页数:11
相关论文
共 50 条
  • [31] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131
  • [32] Efficient Medical Images Text Detection with Vision-Language Pre-training Approach
    Li, Tianyang
    Bai, Jinxu
    Wang, Qingzhu
    Xu, Hanwen
    ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222, 2023, 222
  • [33] VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
    Bao, Hangbo
    Wang, Wenhui
    Dong, Li
    Liu, Qiang
    Mohammed, Owais Khan
    Aggarwal, Kriti
    Som, Subhojit
    Piao, Songhao
    Wei, Furu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [34] GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
    Yin, Da
    Gao, Feng
    Thattai, Govind
    Johnston, Michael
    Chang, Kai -Wei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10951 - 10961
  • [35] Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models
    Zhang, Yi
    Wang, Junyang
    Sang, Jitao
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4996 - 5004
  • [36] Leveraging per Image-Token Consistency for Vision-Language Pre-training
    Gou, Yunhao
    Ko, Tom
    Yang, Hansi
    Kwok, James
    Zhang, Yu
    Wang, Mingxuan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19155 - 19164
  • [37] Enhancing medical text detection with vision-language pre-training and efficient segmentation
    Li, Tianyang
    Bai, Jinxu
    Wang, Qingzhu
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3995 - 4007
  • [38] GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval
    Hong, Weixiang
    Ji, Kaixiang
    Liu, Jiajia
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1379 - 1388
  • [39] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, 2023, : 5120 - 5131
  • [40] Delving into E-Commerce Product Retrieval with Vision-Language Pre-training
    Zheng, Xiaoyang
    Lv, Fuyu
    Wang, Zilong
    Liu, Qingwen
    Zeng, Xiaoyi
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3385 - 3389