D4: Improving LLM Pretraining via Document De-Duplication and Diversification

被引:0
|
作者
Tirumala, Kushal [1 ]
Simig, Daniel [1 ]
Aghajanyan, Armen [1 ]
Morcos, Ari S. [1 ]
机构
[1] Meta AI Res, New York, NY 10031 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as Min-Hash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Dopamine D4 receptor antagonist inhibits melanogenesis through the downregulation of MITF via acceleration of ERK and AKT activation
    Jung, Joon Min
    JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY, 2015, 72 (05) : AB29 - AB29
  • [32] Leukotriene D4 induces stress-fibre formation in intestinal epithelial cells via activation of RhoA and PKCδ
    Massoumi, R
    Larsson, C
    Sjölander, A
    JOURNAL OF CELL SCIENCE, 2002, 115 (17) : 3509 - 3515
  • [33] Dopamine D4 Receptors Regulate GABAA Receptor Trafficking via an Actin/Cofilin/Myosin-dependent Mechanism
    Graziane, Nicholas M.
    Yuen, Eunice Y.
    Yan, Zhen
    JOURNAL OF BIOLOGICAL CHEMISTRY, 2009, 284 (13) : 8329 - 8336
  • [34] Identification of two-dimensional pantographic structure via a linear D4 orthotropic second gradient elastic model
    Placidi, Luca
    Andreaus, Ugo
    Giorgio, Ivan
    JOURNAL OF ENGINEERING MATHEMATICS, 2017, 103 (01) : 1 - 21
  • [35] Identification of two-dimensional pantographic structure via a linear D4 orthotropic second gradient elastic model
    Luca Placidi
    Ugo Andreaus
    Ivan Giorgio
    Journal of Engineering Mathematics, 2017, 103 : 1 - 21
  • [36] Dopamine decreases excitatory inputs to ON sustained ganglion cells via both D1 and D4 receptor-dependent pathways
    Flood, Michael
    Eggers, Erika D.
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2019, 60 (09)
  • [37] Preparation of methyl hydrogen silicone fluids via ring-opening copolymerization of D4 with D4H catalyzed with rare earth solid super acid
    Liu, Jia
    Shao, Qian
    Yang, Xiongfa
    Cao, Cheng
    Chen, Zhonghong
    Lai, Guoqiao
    Gaofenzi Cailiao Kexue Yu Gongcheng/Polymeric Materials Science and Engineering, 2015, 31 (04): : 11 - 16
  • [38] Biluminescence via Fluorescence and Persistent Phosphorescence in Amorphous Organic Donor(D4)-Acceptor(A) Conjugates and Application in Data Security Protection
    Bhatia, Harsh
    Bhattacharjee, Indranil
    Ray, Debdas
    JOURNAL OF PHYSICAL CHEMISTRY LETTERS, 2018, 9 (14): : 3808 - 3813
  • [39] Frontline Science: Structural insights into Resolvin D4 actions and further metabolites via a new total organic synthesis and validation
    Winkler, Jeremy W.
    Libreros, Stephania
    De La Rosa, Xavier
    Sansbury, Brian E.
    Norris, Paul C.
    Chiang, Nan
    Fichtner, David
    Keyes, Gregory S.
    Wourms, Nicholas
    Spite, Matthew
    Serhan, Charles N.
    JOURNAL OF LEUKOCYTE BIOLOGY, 2018, 103 (06) : 995 - 1010
  • [40] Quantum computation of thermal averages for a non-Abelian D4 lattice gauge theory via quantum Metropolis sampling
    Ballini, Edoardo
    Clemente, Giuseppe
    D 'Elia, Massimo
    Maio, Lorenzo
    Zambello, Kevin
    PHYSICAL REVIEW D, 2024, 109 (03)