Adaptive Gating in Mixture-of-Experts based Language Models

被引:0
|
作者
Li, Jiamin [1 ]
Su, Qiang [1 ]
Yang, Yitao [2 ]
Jiang, Yimin
Wang, Cong [1 ]
Xu, Hong [2 ]
机构
[1] City Univ Hong Kong, Hong Kong, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large language models, such as OpenAI's Chat-GPT, have demonstrated exceptional language understanding capabilities in various NLP tasks. Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models while maintaining a constant number of computational operations. Existing MoE model adopts a fixed gating network where each token is computed by the same number of experts. However, this approach contradicts our intuition that the tokens in each sequence vary in terms of their linguistic complexity and, consequently, require different computational costs. Little is discussed in prior research on the trade-off between computation per token and model performance. This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts based on expert probability distribution. The proposed framework preserves sparsity while improving training efficiency. Additionally, curriculum learning is leveraged to further reduce training time. Extensive experiments on diverse NLP tasks show that adaptive gating reduces at most 22.5% training time while maintaining inference quality. Moreover, we conduct a comprehensive analysis of the routing decisions and present our insights when adaptive gating is used.
引用
收藏
页码:3577 / 3587
页数:11
相关论文
共 50 条
  • [1] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
    Du, Nan
    Huang, Yanping
    Dai, Andrew M.
    Tong, Simon
    Lepikhin, Dmitry
    Xu, Yuanzhong
    Krikun, Maxim
    Zhou, Yanqi
    Yu, Adams Wei
    Firat, Orhan
    Zoph, Barret
    Fedus, Liam
    Bosma, Maarten
    Zhou, Zongwei
    Wang, Tao
    Wang, Yu Emma
    Webster, Kellie
    Pellat, Marie
    Robinson, Kevin
    Meier-Hellstern, Kathleen
    Duke, Toju
    Dixon, Lucas
    Zhang, Kun
    Le, Quoc V.
    Wu, Yonghui
    Chen, Zhifeng
    Cui, Claire
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [2] Asymptotic properties of mixture-of-experts models
    Olteanu, M.
    Rynkiewicz, J.
    NEUROCOMPUTING, 2011, 74 (09) : 1444 - 1449
  • [3] Adaptive mixture-of-experts models for data glove interface with multiple users
    Yoon, Jong-Won
    Yang, Sung-Ihk
    Cho, Sung-Bae
    EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (05) : 4898 - 4907
  • [4] A mixture-of-experts framework for adaptive Kalman filtering
    Chaer, WS
    Bishop, RH
    Ghosh, J
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 1997, 27 (03): : 452 - 464
  • [5] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
    Lu, Xudong
    Liu, Qi
    Xu, Yuhui
    Zhou, Aojun
    Huang, Siyuan
    Zhang, Bo
    Yan, Junchi
    Li, Hongsheng
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6159 - 6172
  • [6] A Universal Approximation Theorem for Mixture-of-Experts Models
    Nguyen, Hien D.
    Lloyd-Jones, Luke R.
    McLachlan, Geoffrey J.
    NEURAL COMPUTATION, 2016, 28 (12) : 2585 - 2593
  • [7] On the Benefits of Learning to Route in Mixture-of-Experts Models
    Dikkala, Nishanth
    Ghosh, Nikhil
    Meka, Raghu
    Panigrahy, Rina
    Vyas, Nikhil
    Wang, Xin
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 9376 - 9396
  • [8] Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things
    Yuan, Xiaoming
    Kong, Weixuan
    Luo, Zhenyu
    Xu, Minrui
    ELECTRONICS, 2024, 13 (11)
  • [9] Spatial Mixture-of-Experts
    Dryden, Nikoli
    Hoefler, Torsten
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [10] Overcoming language barriers via machine translation with sparse Mixture-of-Experts fusion of large language models
    Zhu, Shaolin
    Jian, Dong
    Xiong, Deyi
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)