Adaptive Gating in Mixture-of-Experts based Language Models

被引：0

作者：

Li, Jiamin ^{[1
]}

Su, Qiang ^{[1
]}

Yang, Yitao ^{[2
]}

Jiang, Yimin

Wang, Cong ^{[1
]}

Xu, Hong ^{[2
]}

机构：

[1] City Univ Hong Kong, Hong Kong, Peoples R China

[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large language models, such as OpenAI's Chat-GPT, have demonstrated exceptional language understanding capabilities in various NLP tasks. Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models while maintaining a constant number of computational operations. Existing MoE model adopts a fixed gating network where each token is computed by the same number of experts. However, this approach contradicts our intuition that the tokens in each sequence vary in terms of their linguistic complexity and, consequently, require different computational costs. Little is discussed in prior research on the trade-off between computation per token and model performance. This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts based on expert probability distribution. The proposed framework preserves sparsity while improving training efficiency. Additionally, curriculum learning is leveraged to further reduce training time. Extensive experiments on diverse NLP tasks show that adaptive gating reduces at most 22.5% training time while maintaining inference quality. Moreover, we conduct a comprehensive analysis of the routing decisions and present our insights when adaptive gating is used.

引用

页码：3577 / 3587

页数：11

共 50 条

[1] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Du, Nan
Huang, Yanping
Dai, Andrew M.
Tong, Simon
Lepikhin, Dmitry
Xu, Yuanzhong
Krikun, Maxim
Zhou, Yanqi
Yu, Adams Wei
Firat, Orhan
Zoph, Barret
Fedus, Liam
Bosma, Maarten
Zhou, Zongwei
Wang, Tao
Wang, Yu Emma
Webster, Kellie
Pellat, Marie
Robinson, Kevin
Meier-Hellstern, Kathleen
Duke, Toju
Dixon, Lucas
Zhang, Kun
Le, Quoc V.
Wu, Yonghui
Chen, Zhifeng
Cui, Claire
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[2] Asymptotic properties of mixture-of-experts models
Olteanu, M.
Rynkiewicz, J.
NEUROCOMPUTING, 2011, 74 (09) : 1444 - 1449
[3] Adaptive mixture-of-experts models for data glove interface with multiple users
Yoon, Jong-Won
Yang, Sung-Ihk
Cho, Sung-Bae
EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (05) : 4898 - 4907
[4] A mixture-of-experts framework for adaptive Kalman filtering
Chaer, WS
Bishop, RH
Ghosh, J
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 1997, 27 (03): : 452 - 464
[5] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
Lu, Xudong
Liu, Qi
Xu, Yuhui
Zhou, Aojun
Huang, Siyuan
Zhang, Bo
Yan, Junchi
Li, Hongsheng
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6159 - 6172
[6] A Universal Approximation Theorem for Mixture-of-Experts Models
Nguyen, Hien D.
Lloyd-Jones, Luke R.
McLachlan, Geoffrey J.
NEURAL COMPUTATION, 2016, 28 (12) : 2585 - 2593
[7] On the Benefits of Learning to Route in Mixture-of-Experts Models
Dikkala, Nishanth
Ghosh, Nikhil
Meka, Raghu
Panigrahy, Rina
Vyas, Nikhil
Wang, Xin
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 9376 - 9396
[8] Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things
Yuan, Xiaoming
Kong, Weixuan
Luo, Zhenyu
Xu, Minrui
ELECTRONICS, 2024, 13 (11)
[9] Spatial Mixture-of-Experts
Dryden, Nikoli
Hoefler, Torsten
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[10] Overcoming language barriers via machine translation with sparse Mixture-of-Experts fusion of large language models
Zhu, Shaolin
Jian, Dong
Xiong, Deyi
INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)

← 1 2 3 4 5 →