Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

被引:0
|
作者
Mekala, Dheeraj [1 ]
Nguyen, Alex [1 ]
Shang, Jingbo [1 ,2 ]
机构
[1] Univ Calif San Diego, Dept Comp Sci & Engn, La Jolla, CA 92093 USA
[2] Univ Calif San Diego, Halicioglu Data Sci Inst, La Jolla, CA 92093 USA
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Instruction-tuning language models has become a crucial step in aligning them for general use. Typically, this process involves extensive training on large datasets, incurring high training costs. In this paper, we introduce a novel training data selection based on the learning percentage of the samples. We assert that current language models possess the capability to autonomously select high-quality training data, leading to comparable or improved performance compared to training on the entire dataset. Our experiments span different-sized models, revealing that this characteristic holds for models ranging from 1B (small) to 13B (large) in size. Moreover, we demonstrate an interesting finding that the data hardness transfers across model sizes, and a smaller 350M model can effectively curate high-quality training data with hard samples for a larger 13B model, resulting in an equally or superior instructiontuned model compared to training on the complete dataset. Utilizing open-sourced OPT and Llama-2 models up to 13B in size, two publicly available instruction-tuning training datasets and evaluated by both automatic metrics & humans, our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.
引用
收藏
页码:10456 / 10470
页数:15
相关论文
共 50 条
  • [41] Generating Data for Symbolic Language with Large Language Models
    Ye, Jiacheng
    Li, Chengzu
    Kong, Lingpeng
    Yu, Tao
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 8418 - 8443
  • [42] Augmenting interpretable models with large language models during training
    Singh, Chandan
    Askari, Armin
    Caruana, Rich
    Gao, Jianfeng
    NATURE COMMUNICATIONS, 2023, 14 (01)
  • [43] IgboBERT Models: Building and Training Transformer Models for the Igbo Language
    Chukwuneke, Chiamaka
    Ezeani, Ignatius
    Rayson, Paul
    El-Haj, Mahmoud
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5114 - 5122
  • [44] Augmenting interpretable models with large language models during training
    Chandan Singh
    Armin Askari
    Rich Caruana
    Jianfeng Gao
    Nature Communications, 14
  • [45] On the importance of pre-training data volume for compact language models
    Micheli, Vincent
    D'Hoffschmidt, Martin
    Fleuret, Francois
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7853 - 7858
  • [46] Adapting Language Models When Training on Privacy-Transformed Data
    Turan, M. A. Tugtekin
    Klakow, Dietrich
    Vincent, Emmanuel
    Jouvet, Denis
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4367 - 4373
  • [48] Larger and more instructable language models become less reliable
    Zhou, Lexin
    Schellaert, Wout
    Martinez-Plumed, Fernando
    Moros-Daval, Yael
    Ferri, Cesar
    Hernandez-Orallo, Jose
    NATURE, 2024, 634 (8032) : 61 - +
  • [49] InstructGraph: Boosting Large Language Models via Graph-centric Instruction Tuning and Preference Alignment
    Wang, Jianing
    Wu, Junda
    Hon, Yupeng
    Liu, Yao
    Gao, Ming
    McAuley, Julian
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 13492 - 13510
  • [50] Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning
    Li, Teng
    Wang, Jiapeng
    Jin, Lianwen
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 276 - 289