Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引:0
|
作者
Xu, Haiyang [1 ]
Liu, Haoxiang [2 ]
Gong, Wei [1 ]
Wang, Hai [3 ]
Deng, Xianjun [4 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China
[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
关键词
Mixture of experts; Knowledge distillation; Language models;
D O I
10.1007/978-981-97-9437-9_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.
引用
收藏
页码:80 / 91
页数:12
相关论文
共 50 条
  • [41] Supervised mixture of experts models for population health
    Shou, Xiao
    Mavroudeas, Georgios
    Magdon-Ismail, Malik
    Figueroa, Jose
    Kuruzovich, Jason N.
    Bennett, Kristin P.
    METHODS, 2020, 179 : 101 - 110
  • [42] Model selection for the localized mixture of experts models
    Jiang, Yunlu
    Yu Conglian
    Ji Qinghua
    JOURNAL OF APPLIED STATISTICS, 2018, 45 (11) : 1994 - 2006
  • [43] Dynamic Mixture of Experts Models for Online Prediction
    Munezero, Parfait
    Villani, Mattias
    Kohn, Robert
    Kohn, Robert
    TECHNOMETRICS, 2022, : 257 - 268
  • [44] SSS: Editing Factual Knowledge in Language Models towards Semantic Sparse Space
    Wang, Huazheng
    Sun, Haifeng
    Wang, Jingyu
    Qi, Qi
    Xia, Zixuan
    Zhang, Menghao
    Liao, Jianxin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5559 - 5570
  • [45] On Learning Mixture Models with Sparse Parameters
    Mazumdar, Arya
    Pal, Soumyabrata
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
  • [46] ReAugKD: Retrieval-Augmented Knowledge Distillation For Pre-trained Language Models
    Zhang, Jianyi
    Muhamed, Aashiq
    Anantharaman, Aditya
    Wang, Guoyin
    Chen, Changyou
    Zhong, Kai
    Cui, Qingjun
    Xu, Yi
    Zeng, Belinda
    Chilimbi, Trishul
    Chen, Yiran
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 1128 - 1136
  • [47] DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation
    Yang, Mingke
    Chen, Yuqi
    Liu, Yi
    Shi, Ling
    PROCEEDINGS OF THE 33RD ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2024, 2024, : 578 - 589
  • [48] Contextual Mixture of Experts: Integrating Knowledge into Predictive Modeling
    Souza, Francisco
    Offermans, Tim
    Barendse, Ruud
    Postma, Geert
    Jansen, Jeroen
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2023, 19 (08) : 9048 - 9059
  • [49] Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
    Kudugunta, Sneha
    Huang, Yanping
    Bapna, Ankur
    Krikun, Maxim
    Lepikhin, Dmitry
    Thang Luong
    Firat, Orhan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3577 - 3599
  • [50] On the Benefits of Learning to Route in Mixture-of-Experts Models
    Dikkala, Nishanth
    Ghosh, Nikhil
    Meka, Raghu
    Panigrahy, Rina
    Vyas, Nikhil
    Wang, Xin
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 9376 - 9396