D4: Improving LLM Pretraining via Document De-Duplication and Diversification

被引:0
|
作者
Tirumala, Kushal [1 ]
Simig, Daniel [1 ]
Aghajanyan, Armen [1 ]
Morcos, Ari S. [1 ]
机构
[1] Meta AI Res, New York, NY 10031 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as Min-Hash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Improving Accessing Efficiency of Cloud Storage Using De-Duplication and Feedback Schemes
    Wu, Tin-Yu
    Pan, Jeng-Shyang
    Lin, Chia-Fan
    IEEE SYSTEMS JOURNAL, 2014, 8 (01): : 208 - 218
  • [2] Scalable high performance de-duplication backup via hash join
    Tianming YANGDan FENGZhongying NIUYaping WAN Wuhan National Laboratory for OptoelectronicsSchool of Computer Science and TechnologyHuazhong University of Science and TechnologyWuhan China
    Journal of Zhejiang University-Science C(Computer & Electronics), 2010, 11 (05) : 315 - 327
  • [4] Scalable high performance de-duplication backup via hash join
    Tian-ming Yang
    Dan Feng
    Zhong-ying Niu
    Ya-ping Wan
    Journal of Zhejiang University SCIENCE C, 2010, 11 : 315 - 327
  • [5] Scalable high performance de-duplication backup via hash join
    Yang, Tian-ming
    Feng, Dan
    Niu, Zhong-ying
    Wan, Ya-ping
    JOURNAL OF ZHEJIANG UNIVERSITY-SCIENCE C-COMPUTERS & ELECTRONICS, 2010, 11 (05): : 315 - 327
  • [6] Silent Compiler Bug De-duplication via Three-Dimensional Analysis
    Yang, Chen
    Chen, Junjie
    Fan, Xingyu
    Jiang, Jiajun
    Sun, Jun
    PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023, 2023, : 677 - 689
  • [7] Tandem duplication polymorphism upstream of the dopamine d4 receptor gene (DRD4)
    Seaman, MI
    Fisher, JB
    Chang, FM
    Kidd, KK
    AMERICAN JOURNAL OF MEDICAL GENETICS, 1999, 88 (06): : 705 - 709
  • [8] Biserial algebras via subalgebras and the path algebra of D4
    Kuelshammer, Julian
    JOURNAL OF ALGEBRA, 2011, 331 (01) : 58 - 67
  • [9] Octamethylcyclotetrasiloxane (D4) lacks endocrine disruptive potential via estrogen pathways
    Christopher J. Borgert
    Lyle D. Burgoon
    Archives of Toxicology, 2025, 99 (4) : 1431 - 1443
  • [10] Variation of the 120BP duplication in the upstream region of the dopamine D4 receptor and schizophrenia among Danes
    Olsen, L
    Werge, T
    Rasmussen, HB
    AMERICAN JOURNAL OF MEDICAL GENETICS PART B-NEUROPSYCHIATRIC GENETICS, 2004, 130B (01): : 127 - 128