Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

被引：0

作者：

Mekala, Dheeraj ^{[1
]}

Nguyen, Alex ^{[1
]}

Shang, Jingbo ^{[1
,2
]}

机构：

[1] Univ Calif San Diego, Dept Comp Sci & Engn, La Jolla, CA 92093 USA

[2] Univ Calif San Diego, Halicioglu Data Sci Inst, La Jolla, CA 92093 USA

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024 | 2024年

关键词：

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Instruction-tuning language models has become a crucial step in aligning them for general use. Typically, this process involves extensive training on large datasets, incurring high training costs. In this paper, we introduce a novel training data selection based on the learning percentage of the samples. We assert that current language models possess the capability to autonomously select high-quality training data, leading to comparable or improved performance compared to training on the entire dataset. Our experiments span different-sized models, revealing that this characteristic holds for models ranging from 1B (small) to 13B (large) in size. Moreover, we demonstrate an interesting finding that the data hardness transfers across model sizes, and a smaller 350M model can effectively curate high-quality training data with hard samples for a larger 13B model, resulting in an equally or superior instructiontuned model compared to training on the complete dataset. Utilizing open-sourced OPT and Llama-2 models up to 13B in size, two publicly available instruction-tuning training datasets and evaluated by both automatic metrics & humans, our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.

引用

页码：10456 / 10470

页数：15

共 50 条

[41] Generating Data for Symbolic Language with Large Language Models
Ye, Jiacheng
Li, Chengzu
Kong, Lingpeng
Yu, Tao
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 8418 - 8443
[42] Augmenting interpretable models with large language models during training
Singh, Chandan
Askari, Armin
Caruana, Rich
Gao, Jianfeng
NATURE COMMUNICATIONS, 2023, 14 (01)
[43] IgboBERT Models: Building and Training Transformer Models for the Igbo Language
Chukwuneke, Chiamaka
Ezeani, Ignatius
Rayson, Paul
El-Haj, Mahmoud
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5114 - 5122
[44] Augmenting interpretable models with large language models during training
Chandan Singh
Armin Askari
Rich Caruana
Jianfeng Gao
Nature Communications, 14
[45] On the importance of pre-training data volume for compact language models
Micheli, Vincent
D'Hoffschmidt, Martin
Fleuret, Francois
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7853 - 7858
[46] Adapting Language Models When Training on Privacy-Transformed Data
Turan, M. A. Tugtekin
Klakow, Dietrich
Vincent, Emmanuel
Jouvet, Denis
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4367 - 4373
[47] When Large Language Models Shrink > Smaller models trained on curated data challenge the preeminence of Big AI
Gent, Edd
IEEE SPECTRUM, 2023, 60 (07) : 12 - 13
[48] Larger and more instructable language models become less reliable
Zhou, Lexin
Schellaert, Wout
Martinez-Plumed, Fernando
Moros-Daval, Yael
Ferri, Cesar
Hernandez-Orallo, Jose
NATURE, 2024, 634 (8032) : 61 - +
[49] InstructGraph: Boosting Large Language Models via Graph-centric Instruction Tuning and Preference Alignment
Wang, Jianing
Wu, Junda
Hon, Yupeng
Liu, Yao
Gao, Ming
McAuley, Julian
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 13492 - 13510
[50] Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning
Li, Teng
Wang, Jiapeng
Jin, Lianwen
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 276 - 289

← 1 2 3 4 5 →