Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引：0

作者：

Xu, Haiyang ^{[1
]}

Liu, Haoxiang ^{[2
]}

Gong, Wei ^{[1
]}

Wang, Hai ^{[3
]}

Deng, Xianjun ^{[4
]}

机构：

[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China

[2] Alibaba Grp, Hangzhou, Peoples R China

[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China

[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China

来源：

NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024 | 2025年 / 15361卷

关键词：

Mixture of experts; Knowledge distillation; Language models;

D O I：

10.1007/978-981-97-9437-9_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.

引用

页码：80 / 91

页数：12

共 50 条

[41] Supervised mixture of experts models for population health
Shou, Xiao
Mavroudeas, Georgios
Magdon-Ismail, Malik
Figueroa, Jose
Kuruzovich, Jason N.
Bennett, Kristin P.
METHODS, 2020, 179 : 101 - 110
[42] Model selection for the localized mixture of experts models
Jiang, Yunlu
Yu Conglian
Ji Qinghua
JOURNAL OF APPLIED STATISTICS, 2018, 45 (11) : 1994 - 2006
[43] Dynamic Mixture of Experts Models for Online Prediction
Munezero, Parfait
Villani, Mattias
Kohn, Robert
Kohn, Robert
TECHNOMETRICS, 2022, : 257 - 268
[44] SSS: Editing Factual Knowledge in Language Models towards Semantic Sparse Space
Wang, Huazheng
Sun, Haifeng
Wang, Jingyu
Qi, Qi
Xia, Zixuan
Zhang, Menghao
Liao, Jianxin
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5559 - 5570
[45] On Learning Mixture Models with Sparse Parameters
Mazumdar, Arya
Pal, Soumyabrata
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
[46] ReAugKD: Retrieval-Augmented Knowledge Distillation For Pre-trained Language Models
Zhang, Jianyi
Muhamed, Aashiq
Anantharaman, Aditya
Wang, Guoyin
Chen, Changyou
Zhong, Kai
Cui, Qingjun
Xu, Yi
Zeng, Belinda
Chilimbi, Trishul
Chen, Yiran
61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 1128 - 1136
[47] DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation
Yang, Mingke
Chen, Yuqi
Liu, Yi
Shi, Ling
PROCEEDINGS OF THE 33RD ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2024, 2024, : 578 - 589
[48] Contextual Mixture of Experts: Integrating Knowledge into Predictive Modeling
Souza, Francisco
Offermans, Tim
Barendse, Ruud
Postma, Geert
Jansen, Jeroen
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2023, 19 (08) : 9048 - 9059
[49] Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
Kudugunta, Sneha
Huang, Yanping
Bapna, Ankur
Krikun, Maxim
Lepikhin, Dmitry
Thang Luong
Firat, Orhan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3577 - 3599
[50] On the Benefits of Learning to Route in Mixture-of-Experts Models
Dikkala, Nishanth
Ghosh, Nikhil
Meka, Raghu
Panigrahy, Rina
Vyas, Nikhil
Wang, Xin
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 9376 - 9396

← 1 2 3 4 5 →