Scaling Language-Image Pre-training via Masking

被引:17
|
作者
Li, Yanghao [1 ]
Fan, Haoqi [1 ]
Hu, Ronghang [1 ]
Feichtenhofert, Christoph [1 ]
He, Kaiming [1 ]
机构
[1] Meta AI, FAIR, New York, NY 10023 USA
关键词
D O I
10.1109/CVPR52729.2023.02240
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP [52]. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling behavior of increasing the model size, data size, or training length, and report encouraging results and comparisons. We hope that our work will foster future research on scaling vision-language learning.
引用
下载
收藏
页码:23390 / 23400
页数:11
相关论文
共 50 条
  • [1] Grounded Language-Image Pre-training
    Li, Liunian Harold
    Zhang, Pengchuan
    Zhang, Haotian
    Yang, Jianwei
    Li, Chunyuan
    Zhong, Yiwu
    Wang, Lijuan
    Yuan, Lu
    Zhang, Lei
    Hwang, Jenq-Neng
    Chang, Kai-Wei
    Gao, Jianfeng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10955 - 10965
  • [2] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
    Yuan, Hangjie
    Zhang, Shiwei
    Wang, Xiang
    Albanie, Samuel
    Pan, Yining
    Feng, Tao
    Jiang, Jianwen
    Ni, Dong
    Zhang, Yingya
    Zhao, Deli
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21592 - 21604
  • [3] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
    Yuan, Hangjie
    Zhang, Shiwei
    Wang, Xiang
    Albanie, Samuel
    Pan, Yining
    Feng, Tao
    Jiang, Jianwen
    Ni, Dong
    Zhang, Yingya
    Zhao, Deli
    Proceedings of the IEEE International Conference on Computer Vision, 2023, : 21592 - 21604
  • [4] Contrastive Language-Image Pre-Training with Knowledge Graphs
    Pan, Xuran
    Ye, Tianzhu
    Han, Dongchen
    Song, Shiji
    Huang, Gao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [5] ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
    Yang, Kaicheng
    Deng, Jiankang
    An, Xiang
    Li, Jiawei
    Feng, Ziyong
    Guo, Jia
    Yang, Jing
    Liu, Tongliang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2910 - 2919
  • [6] UniCLIP: Unified Framework for Contrastive Language-Image Pre-training
    Lee, Janghyeon
    Kim, Jongsuk
    Shon, Hyounguk
    Kim, Bumsoo
    Kim, Seung Hwan
    Lee, Honglak
    Kim, Junmo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [7] NLIP: Noise-Robust Language-Image Pre-training
    Huang, Runhui
    Long, Yanxin
    Han, Jianhua
    Xu, Hang
    Liang, Xiwen
    Xu, Chunjing
    Liang, Xiaodan
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 926 - 934
  • [8] Non-Contrastive Learning Meets Language-Image Pre-Training
    Zhou, Jinghao
    Dong, Li
    Gan, Zhe
    Wang, Lijuan
    Wei, Furu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11028 - 11038
  • [9] iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-training for Visual Recognition
    Wei, Yixuan
    Cao, Yue
    Zhang, Zheng
    Peng, Houwen
    Yao, Zhuliang
    Xie, Zhenda
    Hue, Han
    Guo, Baining
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2776 - 2786
  • [10] SLIP: Self-supervision Meets Language-Image Pre-training
    Mu, Norman
    Kirillov, Alexander
    Wagner, David
    Xie, Saining
    COMPUTER VISION, ECCV 2022, PT XXVI, 2022, 13686 : 529 - 544