Scaling Language-Image Pre-training via Masking

被引:17
|
作者
Li, Yanghao [1 ]
Fan, Haoqi [1 ]
Hu, Ronghang [1 ]
Feichtenhofert, Christoph [1 ]
He, Kaiming [1 ]
机构
[1] Meta AI, FAIR, New York, NY 10023 USA
关键词
D O I
10.1109/CVPR52729.2023.02240
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP [52]. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling behavior of increasing the model size, data size, or training length, and report encouraging results and comparisons. We hope that our work will foster future research on scaling vision-language learning.
引用
下载
收藏
页码:23390 / 23400
页数:11
相关论文
共 50 条
  • [41] Pre-training via Paraphrasing
    Lewis, Mike
    Ghazvininejad, Marjan
    Ghosh, Gargi
    Aghajanyan, Armen
    Wang, Sida
    Zettlemoyer, Luke
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [42] Enhancing Dynamic Image Advertising with Vision-Language Pre-training
    Wen, Zhoufutu
    Zhao, Xinyu
    Jin, Zhipeng
    Yang, Yi
    Jia, Wei
    Chen, Xiaodong
    Li, Shuanglong
    Liu, Lin
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3310 - 3314
  • [43] Unified Vision-Language Pre-Training for Image Captioning and VQA
    Zhou, Luowei
    Palangi, Hamid
    Zhang, Lei
    Hu, Houdong
    Corso, Jason J.
    Gao, Jianfeng
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 13041 - 13049
  • [44] Pre-Training With Whole Word Masking for Chinese BERT
    Cui, Yiming
    Che, Wanxiang
    Liu, Ting
    Qin, Bing
    Yang, Ziqing
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3504 - 3514
  • [45] Pre-training Universal Language Representation
    Li, Yian
    Zhao, Hai
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 5122 - 5133
  • [46] Cross-Lingual Natural Language Generation via Pre-Training
    Chi, Zewen
    Dong, Li
    Wei, Furu
    Wang, Wenhui
    Mao, Xian-Ling
    Huang, Heyan
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 7570 - 7577
  • [47] Lightweight Model Pre-Training Via Language Guided Knowledge Distillation
    Li M.
    Zhang L.
    Zhu M.
    Huang Z.
    Yu G.
    Fan J.
    Chen T.
    IEEE Transactions on Multimedia, 2024, 26 : 1 - 11
  • [48] Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training
    Lei, Chenyi
    Luo, Shixian
    Liu, Yong
    He, Wanggui
    Wang, Jiamang
    Wang, Guoxin
    Tang, Haihong
    Miao, Chunyan
    Li, Houqiang
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2567 - 2576
  • [49] Context-aware Alignment and Mutual Masking for 3D-Language Pre-training
    Jin, Zhao
    Hayat, Munawar
    Yang, Yuwei
    Guo, Yulan
    Lei, Yinjie
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10984 - 10994
  • [50] Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
    Wu, Junru
    Liang, Yi
    Han, Feng
    Akbari, Hassan
    Wang, Zhangyang
    Yu, Cong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,