Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引：0

作者：

Xu, Haiyang ^{[1
]}

Liu, Haoxiang ^{[2
]}

Gong, Wei ^{[1
]}

Wang, Hai ^{[3
]}

Deng, Xianjun ^{[4
]}

机构：

[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China

[2] Alibaba Grp, Hangzhou, Peoples R China

[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China

[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China

来源：

NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024 | 2025年 / 15361卷

关键词：

Mixture of experts; Knowledge distillation; Language models;

D O I：

10.1007/978-981-97-9437-9_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.

引用

页码：80 / 91

页数：12

共 50 条

[21] Parameter-efficient online knowledge distillation for pretrained language models
Wang, Yukun
Wang, Jin
Zhang, Xuejie
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 265
[22] Conditional information gain networks as sparse mixture of experts
Bicici, Ufuk Can
Akarun, Lale
PATTERN RECOGNITION, 2021, 120
[23] Sparse Mixture of Local Experts for Efficient Speech Enhancement
Sivaraman, Aswin
Kim, Minje
INTERSPEECH 2020, 2020, : 4526 - 4530
[24] Sparse Bayesian Hierarchical Mixture of Experts and Variational Inference
Iikubo, Yuji
Horii, Shunsuke
Matsushima, Toshiyasu
PROCEEDINGS OF 2018 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA2018), 2018, : 60 - 64
[25] Language Modeling with Sparse Product of Sememe Experts
Gu, Yihong
Yan, Jun
Zhu, Hao
Liu, Zhiyuan
Xie, Ruobing
Sun, Maosong
Lin, Fen
Lin, Leyu
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4642 - 4651
[26] Lifelong Language Knowledge Distillation
Chuang, Yung-Sung
Su, Shang-Yu
Chen, Yun-Nung
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2914 - 2924
[27] The anatomy of specialized knowledge: Comparing experts and non-experts through associations, frames and language models
Vintar, Spela
Saksida, Amanda
LEXICOGRAPHICA, 2023, 39 (01) : 165 - 190
[28] sparsesurv: a Python']Python package for fitting sparse survival models via knowledge distillation
Wissel, David
Janakarajan, Nikita
Schulte, Julius
Rowson, Daniel
Yuan, Xintian
Boeva, Valentina
BIOINFORMATICS, 2024, 40 (09)
[29] Had Enough of Experts? Quantitative Knowledge Retrieval From Large Language Models
Selby, David
Iwashita, Yuichiro
Spriestersbach, Kai
Saad, Mohammad
Bappert, Dennis
Warrier, Archana
Mukherjee, Sumantrak
Kise, Koichi
Vollmer, Sebastian
STAT, 2025, 14 (02):
[30] Mixture of Prompt Experts for Natural Language Inference
Zheng, Ziou
Zhu, Xiaodan
2024 IEEE CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, CCECE 2024, 2024, : 43 - 48

← 1 2 3 4 5 →