Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引:0
|
作者
Xu, Haiyang [1 ]
Liu, Haoxiang [2 ]
Gong, Wei [1 ]
Wang, Hai [3 ]
Deng, Xianjun [4 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China
[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
关键词
Mixture of experts; Knowledge distillation; Language models;
D O I
10.1007/978-981-97-9437-9_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.
引用
收藏
页码:80 / 91
页数:12
相关论文
共 50 条
  • [1] Scaling Vision-Language Models with Sparse Mixture of Experts
    Shen, Sheng
    Yao, Zhewei
    Li, Chunyuan
    Darrell, Trevor
    Keutzer, Kurt
    He, Yuxiong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11329 - 11344
  • [2] Overcoming language barriers via machine translation with sparse Mixture-of-Experts fusion of large language models
    Zhu, Shaolin
    Jian, Dong
    Xiong, Deyi
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
  • [3] Adaptive Gating in Mixture-of-Experts based Language Models
    Li, Jiamin
    Su, Qiang
    Yang, Yitao
    Jiang, Yimin
    Wang, Cong
    Xu, Hong
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3577 - 3587
  • [4] Dialogue Summarization with Mixture of Experts based on Large Language Models
    Tian, Yuanhe
    Xia, Fei
    Song, Yan
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7143 - 7155
  • [5] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
    Du, Nan
    Huang, Yanping
    Dai, Andrew M.
    Tong, Simon
    Lepikhin, Dmitry
    Xu, Yuanzhong
    Krikun, Maxim
    Zhou, Yanqi
    Yu, Adams Wei
    Firat, Orhan
    Zoph, Barret
    Fedus, Liam
    Bosma, Maarten
    Zhou, Zongwei
    Wang, Tao
    Wang, Yu Emma
    Webster, Kellie
    Pellat, Marie
    Robinson, Kevin
    Meier-Hellstern, Kathleen
    Duke, Toju
    Dixon, Lucas
    Zhang, Kun
    Le, Quoc V.
    Wu, Yonghui
    Chen, Zhifeng
    Cui, Claire
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [6] SPARSE BAYESIAN HIERARCHICAL MIXTURE OF EXPERTS
    Mossavat, Iman
    Amft, Oliver
    2011 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP (SSP), 2011, : 653 - 656
  • [7] Scaling Vision with Sparse Mixture of Experts
    Riquelme, Carlos
    Puigcerver, Joan
    Mustafa, Basil
    Neumann, Maxim
    Jenatton, Rodolphe
    Pinto, Andre Susano
    Keysers, Daniel
    Houlsby, Neil
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [8] On the Representation Collapse of Sparse Mixture of Experts
    Chi, Zewen
    Dong, Li
    Huang, Shaohan
    Dai, Damai
    Ma, Shuming
    Patra, Barun
    Singhal, Saksham
    Bajaj, Payal
    Song, Xia
    Mao, Xian-Ling
    Huang, Heyan
    Wei, Furu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [9] Harnessing the Power of Prompt Experts: Efficient Knowledge Distillation for Enhanced Language Understanding
    Meng, Xv
    Rao, Jun
    Qi, Shuhan
    Wang, Lei
    Xiao, Jing
    Wang, Xuan
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES-RESEARCH TRACK AND DEMO TRACK, PT VIII, ECML PKDD 2024, 2024, 14948 : 218 - 234
  • [10] Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models
    Liu, Juncai
    Wang, Jessie Hui
    Jiang, Yimin
    PROCEEDINGS OF THE 2023 ACM SIGCOMM 2023 CONFERENCE, SIGCOMM 2023, 2023, : 486 - 498