D4: Improving LLM Pretraining via Document De-Duplication and Diversification

被引:0
|
作者
Tirumala, Kushal [1 ]
Simig, Daniel [1 ]
Aghajanyan, Armen [1 ]
Morcos, Ari S. [1 ]
机构
[1] Meta AI Res, New York, NY 10031 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as Min-Hash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Evidence for linkage of a tandem duplication polymorphism upstream of the dopamine D4 receptor gene (DRD4) with attention deficit hyperactivity disorder (ADHD)
    J T McCracken
    S L Smalley
    J J McGough
    L Crawford
    M Del'Homme
    R M Cantor
    A Liu
    S F Nelson
    Molecular Psychiatry, 2000, 5 : 531 - 536
  • [22] Evidence for linkage of a tandem duplication polymorphism upstream of the dopamine D4 receptor gene (DRD4) with attention deficit hyperactivity disorder (ADHD)
    McCracken, JT
    Smalley, SL
    McGough, JJ
    Crawford, L
    Del'Homme, M
    Cantor, RM
    Liu, A
    Nelson, SF
    MOLECULAR PSYCHIATRY, 2000, 5 (05) : 531 - 536
  • [23] *TUMEUR INTRAMEDULLAIRE ETENDUE DE C2 A D4 - ABLATION DUNE SEULE PIECE
    THUREL, R
    REVUE NEUROLOGIQUE, 1947, 79 (02) : 132 - 133
  • [24] *TUMEUR INTRAMEDULLAIRE ETENDUE DE C2 A D4 - ABLATION DUNE SEULE PIECE
    THUREL, R
    SEMAINE DES HOPITAUX, 1947, 23 (12): : 774 - 774
  • [25] TUMEUR INTRAMEDULLAIRE ETENDUE DE C2 A D4 - GUERISON DEPUIS 5 ANS
    THUREL, R
    REVUE NEUROLOGIQUE, 1952, 86 (01) : 54 - 55
  • [26] SPECTRES DE VIBRATION DE LIMIDAZOLE DE LIMIDAZOLE (D)-1 DE LIMIDAZOLE (D3)-2,4,5 ET DE LIMIDAZOLE (D4) .2. REGION ENTRE 1700 ET 30 CM-1
    PERCHARD, C
    BELLOCQ, AM
    NOVAK, A
    JOURNAL DE CHIMIE PHYSIQUE, 1965, 62 (11-1): : 1344 - &
  • [27] Association between the 120-bp duplication of the dopamine D4 receptor gene and attention deficit hyperactivity disorder: Genetic and molecular analyses
    Kereszturi, Eva
    Kiraly, Orsolya
    Csapo, Zsolt
    Tarnok, Zsanett
    Gadoros, Julia
    Sasvari-Szekely, Maria
    Nemoda, Zsofia
    AMERICAN JOURNAL OF MEDICAL GENETICS PART B-NEUROPSYCHIATRIC GENETICS, 2007, 144B (02) : 231 - 236
  • [28] Dopamine D4 receptor antagonist inhibits melanogenesis through transcriptional downregulation of MITF via ERK signalling
    Jung, Joon Min
    Kim, Su Yeon
    Lee, Woo Jin
    Hwang, Jae Sung
    Chang, Sung Eun
    EXPERIMENTAL DERMATOLOGY, 2016, 25 (04) : 325 - 328
  • [29] DOPAMINE INHIBITS GABA TRANSMISSION FROM THE GLOBUS PALLIDUS TO THE THALAMIC RETICULAR NUCLEUS VIA PRESYNAPTIC D4 RECEPTORS
    Gasca-Martinez, D.
    Hernandez, A.
    Sierra, A.
    Valdiosera, R.
    Anaya-Martinez, V.
    Floran, B.
    Erlij, D.
    Aceves, J.
    NEUROSCIENCE, 2010, 169 (04) : 1672 - 1681
  • [30] Leukotriene D4 accelerates antigen-mediated mast cell responses via the cysteinyl leukotriene 1 receptor
    Fujisawa, Sakura
    Nagata, Yuka
    Suzuki, Ryo
    CELLULAR IMMUNOLOGY, 2022, 382