Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引：0

作者：

Xu, Haiyang ^{[1
]}

Liu, Haoxiang ^{[2
]}

Gong, Wei ^{[1
]}

Wang, Hai ^{[3
]}

Deng, Xianjun ^{[4
]}

机构：

[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China

[2] Alibaba Grp, Hangzhou, Peoples R China

[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China

[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China

来源：

NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024 | 2025年 / 15361卷

关键词：

Mixture of experts; Knowledge distillation; Language models;

D O I：

10.1007/978-981-97-9437-9_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.

引用

页码：80 / 91

页数：12

共 50 条

[1] Scaling Vision-Language Models with Sparse Mixture of Experts
Shen, Sheng
Yao, Zhewei
Li, Chunyuan
Darrell, Trevor
Keutzer, Kurt
He, Yuxiong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11329 - 11344
[2] Overcoming language barriers via machine translation with sparse Mixture-of-Experts fusion of large language models
Zhu, Shaolin
Jian, Dong
Xiong, Deyi
INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
[3] Adaptive Gating in Mixture-of-Experts based Language Models
Li, Jiamin
Su, Qiang
Yang, Yitao
Jiang, Yimin
Wang, Cong
Xu, Hong
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3577 - 3587
[4] Dialogue Summarization with Mixture of Experts based on Large Language Models
Tian, Yuanhe
Xia, Fei
Song, Yan
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7143 - 7155
[5] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Du, Nan
Huang, Yanping
Dai, Andrew M.
Tong, Simon
Lepikhin, Dmitry
Xu, Yuanzhong
Krikun, Maxim
Zhou, Yanqi
Yu, Adams Wei
Firat, Orhan
Zoph, Barret
Fedus, Liam
Bosma, Maarten
Zhou, Zongwei
Wang, Tao
Wang, Yu Emma
Webster, Kellie
Pellat, Marie
Robinson, Kevin
Meier-Hellstern, Kathleen
Duke, Toju
Dixon, Lucas
Zhang, Kun
Le, Quoc V.
Wu, Yonghui
Chen, Zhifeng
Cui, Claire
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[6] SPARSE BAYESIAN HIERARCHICAL MIXTURE OF EXPERTS
Mossavat, Iman
Amft, Oliver
2011 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP (SSP), 2011, : 653 - 656
[7] Scaling Vision with Sparse Mixture of Experts
Riquelme, Carlos
Puigcerver, Joan
Mustafa, Basil
Neumann, Maxim
Jenatton, Rodolphe
Pinto, Andre Susano
Keysers, Daniel
Houlsby, Neil
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[8] On the Representation Collapse of Sparse Mixture of Experts
Chi, Zewen
Dong, Li
Huang, Shaohan
Dai, Damai
Ma, Shuming
Patra, Barun
Singhal, Saksham
Bajaj, Payal
Song, Xia
Mao, Xian-Ling
Huang, Heyan
Wei, Furu
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[9] Harnessing the Power of Prompt Experts: Efficient Knowledge Distillation for Enhanced Language Understanding
Meng, Xv
Rao, Jun
Qi, Shuhan
Wang, Lei
Xiao, Jing
Wang, Xuan
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES-RESEARCH TRACK AND DEMO TRACK, PT VIII, ECML PKDD 2024, 2024, 14948 : 218 - 234
[10] Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models
Liu, Juncai
Wang, Jessie Hui
Jiang, Yimin
PROCEEDINGS OF THE 2023 ACM SIGCOMM 2023 CONFERENCE, SIGCOMM 2023, 2023, : 486 - 498

← 1 2 3 4 5 →