Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引:0
|
作者
Xu, Haiyang [1 ]
Liu, Haoxiang [2 ]
Gong, Wei [1 ]
Wang, Hai [3 ]
Deng, Xianjun [4 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China
[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
关键词
Mixture of experts; Knowledge distillation; Language models;
D O I
10.1007/978-981-97-9437-9_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.
引用
收藏
页码:80 / 91
页数:12
相关论文
共 50 条
  • [21] Parameter-efficient online knowledge distillation for pretrained language models
    Wang, Yukun
    Wang, Jin
    Zhang, Xuejie
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 265
  • [22] Conditional information gain networks as sparse mixture of experts
    Bicici, Ufuk Can
    Akarun, Lale
    PATTERN RECOGNITION, 2021, 120
  • [23] Sparse Mixture of Local Experts for Efficient Speech Enhancement
    Sivaraman, Aswin
    Kim, Minje
    INTERSPEECH 2020, 2020, : 4526 - 4530
  • [24] Sparse Bayesian Hierarchical Mixture of Experts and Variational Inference
    Iikubo, Yuji
    Horii, Shunsuke
    Matsushima, Toshiyasu
    PROCEEDINGS OF 2018 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA2018), 2018, : 60 - 64
  • [25] Language Modeling with Sparse Product of Sememe Experts
    Gu, Yihong
    Yan, Jun
    Zhu, Hao
    Liu, Zhiyuan
    Xie, Ruobing
    Sun, Maosong
    Lin, Fen
    Lin, Leyu
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4642 - 4651
  • [26] Lifelong Language Knowledge Distillation
    Chuang, Yung-Sung
    Su, Shang-Yu
    Chen, Yun-Nung
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2914 - 2924
  • [27] The anatomy of specialized knowledge: Comparing experts and non-experts through associations, frames and language models
    Vintar, Spela
    Saksida, Amanda
    LEXICOGRAPHICA, 2023, 39 (01) : 165 - 190
  • [28] sparsesurv: a Python']Python package for fitting sparse survival models via knowledge distillation
    Wissel, David
    Janakarajan, Nikita
    Schulte, Julius
    Rowson, Daniel
    Yuan, Xintian
    Boeva, Valentina
    BIOINFORMATICS, 2024, 40 (09)
  • [29] Had Enough of Experts? Quantitative Knowledge Retrieval From Large Language Models
    Selby, David
    Iwashita, Yuichiro
    Spriestersbach, Kai
    Saad, Mohammad
    Bappert, Dennis
    Warrier, Archana
    Mukherjee, Sumantrak
    Kise, Koichi
    Vollmer, Sebastian
    STAT, 2025, 14 (02):
  • [30] Mixture of Prompt Experts for Natural Language Inference
    Zheng, Ziou
    Zhu, Xiaodan
    2024 IEEE CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, CCECE 2024, 2024, : 43 - 48