Adaptive Gating in Mixture-of-Experts based Language Models

被引:0
|
作者
Li, Jiamin [1 ]
Su, Qiang [1 ]
Yang, Yitao [2 ]
Jiang, Yimin
Wang, Cong [1 ]
Xu, Hong [2 ]
机构
[1] City Univ Hong Kong, Hong Kong, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large language models, such as OpenAI's Chat-GPT, have demonstrated exceptional language understanding capabilities in various NLP tasks. Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models while maintaining a constant number of computational operations. Existing MoE model adopts a fixed gating network where each token is computed by the same number of experts. However, this approach contradicts our intuition that the tokens in each sequence vary in terms of their linguistic complexity and, consequently, require different computational costs. Little is discussed in prior research on the trade-off between computation per token and model performance. This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts based on expert probability distribution. The proposed framework preserves sparsity while improving training efficiency. Additionally, curriculum learning is leveraged to further reduce training time. Extensive experiments on diverse NLP tasks show that adaptive gating reduces at most 22.5% training time while maintaining inference quality. Moreover, we conduct a comprehensive analysis of the routing decisions and present our insights when adaptive gating is used.
引用
收藏
页码:3577 / 3587
页数:11
相关论文
共 50 条
  • [41] Steered Mixture-of-Experts for Light Field Video Coding
    Avramelos, Vasileios
    Saenen, Ignace
    Verhack, Ruben
    Van Wallendael, Glenn
    Lambert, Peter
    Sikora, Thomas
    APPLICATIONS OF DIGITAL IMAGE PROCESSING XLI, 2018, 10752
  • [42] Steered Mixture-of-Experts Approximation of Spherical Image Data
    Verhack, Ruben
    Madhu, Nilesh
    Van Wallendael, Glenn
    Lambert, Peter
    Sikora, Thomas
    2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2018, : 256 - 260
  • [43] A mixture-of-experts approach for gene regulatory network inference
    Shao, Borong
    Lavesson, Niklas
    Boeva, Veselka
    Shahzad, Raja Khurram
    INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2016, 14 (03) : 258 - 275
  • [44] TabMoE: A General Framework for Diverse Table-Based Reasoning with Mixture-of-Experts
    Wu, Jie
    Hou, Mengshu
    MATHEMATICS, 2024, 12 (19)
  • [45] Practical and theoretical aspects of mixture-of-experts modeling: An overview
    Nguyen, Hien D.
    Chamroukhi, Faicel
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2018, 8 (04)
  • [46] On-line learning of a mixture-of-experts neural network
    Huh, NJ
    Oh, JH
    Kang, K
    JOURNAL OF PHYSICS A-MATHEMATICAL AND GENERAL, 2000, 33 (48): : 8663 - 8672
  • [47] Exploring structure-property relationships in sparse data environments using mixture-of-experts models
    Cheenady, Amith Adoor
    Mukherjee, Arpan
    Dongol, Ruhil
    Rajan, Krishna
    MRS BULLETIN, 2025, 50 (01) : 32 - 43
  • [48] MoE-SLU: Towards ASR-Robust Spoken Language Understanding via Mixture-of-Experts
    Cheng, Xuxin
    Zhu, Zhihong
    Zhuang, Xianwei
    Chen, Zhanpeng
    Huang, Zhiqi
    Zou, Yuexian
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 14868 - 14879
  • [49] MoE-SPNet: A mixture-of-experts scene parsing network
    Fu, Huan
    Gong, Mingming
    Wang, Chaohui
    Tao, Dacheng
    PATTERN RECOGNITION, 2018, 84 : 226 - 236
  • [50] SPEECHMOE2: MIXTURE-OF-EXPERTS MODEL WITH IMPROVED ROUTING
    You, Zhao
    Feng, Shulin
    Su, Dan
    Yu, Dong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7217 - 7221