Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

被引:7
|
作者
Radenovic, Filip [1 ]
Dubey, Abhimanyu [1 ]
Kadian, Abhishek [1 ]
Mihaylov, Todor [1 ]
Vandenhende, Simon [1 ]
Patel, Yash [2 ]
Wen, Yi [1 ]
Ramanathan, Vignesh [1 ]
Mahajan, Dhruv [1 ]
机构
[1] Meta AI, New York, NY 10003 USA
[2] Czech Tech Univ, Prague, Czech Republic
关键词
D O I
10.1109/CVPR52729.2023.00673
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work. Models are available at github.com/facebookresearch/diht.
引用
收藏
页码:6967 / 6977
页数:11
相关论文
共 50 条
  • [41] Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis
    Ling, Yan
    Yu, Jianfei
    Xia, Rui
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2149 - 2159
  • [42] Multimodal detection of hateful memes by applying a vision-language pre-training model
    Chen, Yuyang
    Pan, Feng
    PLOS ONE, 2022, 17 (09):
  • [43] Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
    Dai, Wenliang
    Liu, Zihan
    Ji, Ziwei
    Su, Dan
    Fung, Pascale
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2136 - 2148
  • [44] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
    Li, Junnan
    Li, Dongxu
    Xiong, Caiming
    Hoi, Steven
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [45] VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
    Wang, Teng
    Jiang, Wenhao
    Lu, Zhichao
    Zheng, Feng
    Cheng, Ran
    Yin, Chengguo
    Luo, Ping
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [46] Patch is enough: naturalistic adversarial patch against vision-language pre-training models
    Dehong Kong
    Siyuan Liang
    Xiaopeng Zhu
    Yuansheng Zhong
    Wenqi Ren
    Visual Intelligence, 2 (1):
  • [47] Clinical-BERT: Vision-Language Pre-training for Radiograph Diagnosis and Reports Generation
    Yan, Bin
    Pei, Mingtao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2982 - 2990
  • [48] To Boost Zero-Shot Generalization for Embodied Reasoning with Vision-Language Pre-Training
    Su, Ke
    Zhang, Xingxing
    Zhang, Siyang
    Zhu, Jun
    Zhang, Bo
    IEEE Transactions on Image Processing, 2024, 33 : 5370 - 5381
  • [49] Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
    Ji, Yatai
    Tu, Rongcheng
    Jiang, Jie
    Kong, Weijie
    Cai, Chengfei
    Zhao, Wenzhe
    Wang, Hongfa
    Yang, Yujiu
    Liu, Wei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6789 - 6798
  • [50] Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting
    Xue, Chuhui
    Zhang, Wenqing
    Hao, Yu
    Lu, Shijian
    Torr, Philip H. S.
    Bai, Song
    COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 284 - 302