D4: Improving LLM Pretraining via Document De-Duplication and Diversification

被引:0
|
作者
Tirumala, Kushal [1 ]
Simig, Daniel [1 ]
Aghajanyan, Armen [1 ]
Morcos, Ari S. [1 ]
机构
[1] Meta AI Res, New York, NY 10031 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as Min-Hash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Diversification of a Thieno[2,3-d]pyrimidin-4-one Scaffold via Regioselective Alkylation Reactions
    Dzhavakhishvili, Sergey G.
    Gorobets, Nikolay Yu.
    Shishkina, Svetlana V.
    Shishkin, Oleg V.
    Desenko, Sergey M.
    Groth, Ulrich M.
    JOURNAL OF COMBINATORIAL CHEMISTRY, 2009, 11 (03): : 508 - 514
  • [42] ALGIES DE LA COLONNE DORSALE LOCALISEES ENTRE D4 ET D7 - DANS LE CADRE DES SYNDROMES GYNECOLOGIQUES RAPPORTES A LHYPERFOLLICULINIE
    BRET, AJ
    BARDIAUX, M
    PRESSE MEDICALE, 1951, 59 (81): : 1698 - 1700
  • [43] Stress Exposure in Dopamine D4 Receptor Knockout Mice Induces Schizophrenia-Like Behaviors via Disruption of GABAergic Transmission
    Tan, Tao
    Wang, Wei
    Williams, Jamal
    Ma, Kaijie
    Cao, Qing
    Yan, Zhen
    SCHIZOPHRENIA BULLETIN, 2019, 45 (05) : 1012 - 1023
  • [44] Leukotriene D4 mediates survival and proliferation via separate but parallel pathways in the human intestinal epithelial cell line int 407
    Paruchuri, S
    Sjölander, A
    JOURNAL OF BIOLOGICAL CHEMISTRY, 2003, 278 (46) : 45577 - 45585
  • [45] SUR LE COMPORTEMENT PHYSICO-CHIMIQUE DES SOLUTIONS DACIDE DESOXYRIBONUCLEIQUE DU PHAGE D4 DE SALMONELLA-ENTERITIDIS
    BARBU, E
    BASSET, J
    JOLY, M
    WAHL, R
    JOURNAL OF POLYMER SCIENCE, 1957, 23 (104): : 717 - 738
  • [46] Endogenous production of leukotriene D4 mediates autocrine survival and proliferation via CysLT1 receptor signalling in intestinal epithelial cells
    Paruchuri, S.
    Mezhybovska, M.
    Juhas, M.
    Sjolander, A.
    ONCOGENE, 2006, 25 (50) : 6660 - 6665
  • [47] A dopamine D4 receptor antagonist attenuates ischemia-induced neuronal cell damage via upregulation of neuronal apoptosis inhibitory protein
    Okada, Y
    Sakai, H
    Kohiki, E
    Suga, E
    Yanagisawa, Y
    Tanaka, K
    Hadano, S
    Osuga, H
    Ikeda, JE
    JOURNAL OF CEREBRAL BLOOD FLOW AND METABOLISM, 2005, 25 (07): : 794 - 806
  • [48] Leukotriene D4 affects localisation of vinculin in intestinal epithelial cells via distinct tyrosine kinase and protein kinase C controlled events
    Massoumi, R
    Sjölander, A
    JOURNAL OF CELL SCIENCE, 2001, 114 (10) : 1925 - 1934
  • [49] Variation in mothers' arginine vasopressin receptor 1a and dopamine receptor D4 genes predicts maternal sensitivity via social cognition
    Leerkes, E. M.
    Su, J.
    Calkins, S.
    Henrich, V. C.
    Smolen, A.
    GENES BRAIN AND BEHAVIOR, 2017, 16 (02) : 233 - 240
  • [50] Dopamine D2-like (D2 and D4) receptor activation leads to mitogenesis via p44/42 MAPK stimulation in opossum kidney cells
    Narkar, VA
    Hussain, T
    Lokhandwala, MF
    FASEB JOURNAL, 2000, 14 (08): : A1550 - A1550