Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

被引：1

作者：

Wang, Wei ^{[1
]}

Lai, Zhiquan ^{[1
]}

Li, Shengwei ^{[1
]}

Liu, Weijie ^{[1
]}

Ge, Keshi ^{[1
]}

Liu, Yujie ^{[1
]}

Shen, Ao ^{[1
]}

Li, Dongsheng ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Comp, Natl Lab Parallel & Distributed Proc PDL, Changsha, Peoples R China

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, CLUSTER | 2023年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

mixture of experts; distributed training;

D O I：

10.1109/CLUSTER52292.2023.00015

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-large size with negligible increases in computation. The MoE model has achieved the highest accuracy in several domains. However, a significant load imbalance occurs in the device during the training of a MoE model, resulting in significantly reduced throughput. Previous works on load balancing either harm model convergence or suffer from high execution overhead. To address these issues, we present Prophet: a fine-grained load balancing method for parallel training of large-scale MoE models, which consists of a planner and a scheduler. Prophet planner first employs a fine-grained resource allocation method to determine the possible scenarios for the expert placement in a fine-grained manner, and then efficiently searches for a well-balanced expert placement to balance the load without introducing additional overhead. Prophet scheduler exploits the locality of the token distribution to schedule the resource allocation operations using a layer-wise fine-grained schedule strategy to hide their overhead. We conduct extensive experiments in four clusters and five representative models. The results indicate that Prophet gains up to 2.3x speedup compared to the state-of-the-art MoE frameworks including Deepspeed-MoE and FasterMoE. Additionally, Prophet achieves a load balancing enhancement of up to 12.06x when compared to FasterMoE.

引用

页码：82 / 94

页数：13

共 50 条

[21] Load-balancing for load-imbalanced fine-grained linear pipelines
Mastoras, Aristeidis
Gross, Thomas R.
PARALLEL COMPUTING, 2019, 85 : 178 - 189
[22] Techniques and Challenges of Implementing Large Scale Logic Design Models in Massively Parallel Fine-Grained Multiprocessor Systems
Beletsky, Platon
Bershteyn, Mike
Birguer, Alexandre
Ho, Chunkuen
Salitrennik, Viktor
2013 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN (ICCAD), 2013, : 473 - 477
[23] Parallel load-balancing for combustion with spray for large-scale simulation
Thari, A.
Treleaven, N. C. W.
Staufer, M.
Page, G. J.
JOURNAL OF COMPUTATIONAL PHYSICS, 2021, 434
[24] MCSHIPS: A LARGE-SCALE SHIP DATASET FOR DETECTION AND FINE-GRAINED CATEGORIZATION IN THE WILD
Zheng, Yitong
Zhang, Shun
2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
[25] A Large-Scale Frontal Vehicle Image Dataset for Fine-Grained Vehicle Categorization
Lu, Lei
Wang, Ping
Huang, Hua
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (03) : 1818 - 1828
[26] Scheduling parallel processes and load balancing in large-scale computing systems
Kutepov, V. P.
DCABES 2007 Proceedings, Vols I and II, 2007, : 444 - 448
[27] UrbanBIS: a Large-scale Benchmark for Fine-grained Urban Building Instance Segmentation
Yang, Guoqing
Xue, Fuyou
Zhang, Qi
Xie, Ke
Fu, Chi-Wing
Huang, Hui
PROCEEDINGS OF SIGGRAPH 2023 CONFERENCE PAPERS, SIGGRAPH 2023, 2023,
[28] Fine-grained distributed averaging for large-scale radio interferometric measurement sets
Shou-Lin Wei
Kai-Da Luo
Feng Wang
Hui Deng
Ying Mei
Research in Astronomy and Astrophysics, 2021, 21 (04) : 17 - 24
[29] Learning fine-grained features via a CNN Tree for Large-scale Classification
Wang, Zhenhua
Wang, Xingxing
Wang, Gang
NEUROCOMPUTING, 2018, 275 : 1231 - 1240
[30] AMP-SPACE: A LARGE-SCALE DATASET FOR FINE-GRAINED TIMBRE TRANSFORMATION
Naradowsky, Jason
2021 24TH INTERNATIONAL CONFERENCE ON DIGITAL AUDIO EFFECTS (DAFX), 2021, : 57 - 64

← 1 2 3 4 5 →