Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models

被引:0
|
作者
Wang, Ruida [1 ]
Zhou, Wangchunshu [2 ]
Sachan, Mrinmaya [3 ]
机构
[1] HKUST, Hong Kong, Peoples R China
[2] AIWaves Inc, Cardiff, Wales
[3] Swiss Fed Inst Technol, Zurich, Switzerland
基金
瑞士国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data Synthesis is a promising way to train a small model with very little labeled data. One approach for data synthesis is to leverage the rich knowledge from large language models to synthesize pseudo training examples for small models, making it possible to achieve both data and compute efficiency at the same time. However, a key challenge in data synthesis is that the synthesized dataset often suffers from a large distributional discrepancy from the real task data distribution. Thus, in this paper, we propose Synthesis Step by Step (S3), a data synthesis framework that shrinks this distribution gap by iteratively extrapolating the errors made by a small model trained on the synthesized dataset on a small real-world validation dataset using a large language model. Extensive experiments on multiple NLP tasks show that our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data, resulting in significant improvement compared to several baselines: 9.48% improvement compared to ZeroGen, 2.73% compared to GoldGen, and 15.17% improvement compared to the small model trained on human-annotated data.(1)
引用
收藏
页码:11817 / 11831
页数:15
相关论文
共 43 条
  • [1] Multi-step Iterative Automated Domain Modeling with Large Language Models
    Yang, Yujing
    Chen, Boqi
    Chen, Kua
    Mussbacher, Gunter
    Varro, Daniel
    ACM/IEEE 27TH INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS: COMPANION PROCEEDINGS, MODELS 2024, 2024, : 587 - 595
  • [2] Instruct Large Language Models to Generate Scientific Literature Survey Step by Step
    Lai, Yuxuan
    Wu, Yupeng
    Wang, Yidan
    Hu, Wenpeng
    Zheng, Chen
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT V, NLPCC 2024, 2025, 15363 : 484 - 496
  • [3] The Impact of Reasoning Step Length on Large Language Models
    Jin, Mingyu
    Yu, Qinkai
    Dong, Shu
    Zhao, Haiyan
    Hua, Wenyue
    Meng, Yanda
    Zhang, Yongfeng
    Du, Mengnan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1830 - 1842
  • [4] Bridging the gap: a practical step-by-step approach to warrant safe implementation of large language models in healthcare
    Workum, Jessica D.
    van de Sande, Davy
    Gommers, Diederik
    van Genderen, Michel E.
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2025, 8
  • [5] Making Large Language Models Better Reasoners with Step-Aware Verifier
    Li, Yifei
    Lin, Zeqi
    Zhang, Shizhuo
    Fu, Qiang
    Chen, Bei
    Lou, Jian-Guang
    Chen, Weizhu
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5315 - 5333
  • [6] ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models
    Heng, Yuzhao
    Deng, Chunyuan
    Li, Yitong
    Yu, Yue
    Li, Yinghao
    Zhang, Rongzhi
    Zhang, Chao
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15992 - 16030
  • [7] A Step Towards Verification and Synthesis from Simulink/Stateflow Models
    Manamcheri, Karthik
    Mitra, Sayan
    Bak, Stanley
    Caccamo, Marco
    HSCC 11: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON HYBRID SYSTEMS: COMPUTATION AND CONTROL, 2011, : 317 - 318
  • [8] Next-Step Hint Generation for Introductory Programming Using Large Language Models
    Roest, Lianne
    Keuning, Hieke
    Jeuring, Johan
    PROCEEDINGS OF THE 26TH AUSTRALASIAN COMPUTING EDUCATION CONFERENCE, ACE 2024, 2024, : 144 - 153
  • [9] The first step is the hardest: pitfalls of representing and tokenizing temporal data for large language models
    Spathis, Dimitris
    Kawsar, Fahim
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 2151 - 2158
  • [10] INFORM : Information eNtropy based multi-step reasoning FOR large language Models
    Zhou, Chuyue
    You, Wangjie
    Li, Juntao
    Ye, Jing
    Chen, Kehai
    Zhang, Min
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3565 - 3576