Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

被引:1
|
作者
Wang, Wei [1 ]
Lai, Zhiquan [1 ]
Li, Shengwei [1 ]
Liu, Weijie [1 ]
Ge, Keshi [1 ]
Liu, Yujie [1 ]
Shen, Ao [1 ]
Li, Dongsheng [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp, Natl Lab Parallel & Distributed Proc PDL, Changsha, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
mixture of experts; distributed training;
D O I
10.1109/CLUSTER52292.2023.00015
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-large size with negligible increases in computation. The MoE model has achieved the highest accuracy in several domains. However, a significant load imbalance occurs in the device during the training of a MoE model, resulting in significantly reduced throughput. Previous works on load balancing either harm model convergence or suffer from high execution overhead. To address these issues, we present Prophet: a fine-grained load balancing method for parallel training of large-scale MoE models, which consists of a planner and a scheduler. Prophet planner first employs a fine-grained resource allocation method to determine the possible scenarios for the expert placement in a fine-grained manner, and then efficiently searches for a well-balanced expert placement to balance the load without introducing additional overhead. Prophet scheduler exploits the locality of the token distribution to schedule the resource allocation operations using a layer-wise fine-grained schedule strategy to hide their overhead. We conduct extensive experiments in four clusters and five representative models. The results indicate that Prophet gains up to 2.3x speedup compared to the state-of-the-art MoE frameworks including Deepspeed-MoE and FasterMoE. Additionally, Prophet achieves a load balancing enhancement of up to 12.06x when compared to FasterMoE.
引用
收藏
页码:82 / 94
页数:13
相关论文
共 50 条
  • [21] Load-balancing for load-imbalanced fine-grained linear pipelines
    Mastoras, Aristeidis
    Gross, Thomas R.
    PARALLEL COMPUTING, 2019, 85 : 178 - 189
  • [22] Techniques and Challenges of Implementing Large Scale Logic Design Models in Massively Parallel Fine-Grained Multiprocessor Systems
    Beletsky, Platon
    Bershteyn, Mike
    Birguer, Alexandre
    Ho, Chunkuen
    Salitrennik, Viktor
    2013 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN (ICCAD), 2013, : 473 - 477
  • [23] Parallel load-balancing for combustion with spray for large-scale simulation
    Thari, A.
    Treleaven, N. C. W.
    Staufer, M.
    Page, G. J.
    JOURNAL OF COMPUTATIONAL PHYSICS, 2021, 434
  • [24] MCSHIPS: A LARGE-SCALE SHIP DATASET FOR DETECTION AND FINE-GRAINED CATEGORIZATION IN THE WILD
    Zheng, Yitong
    Zhang, Shun
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [25] A Large-Scale Frontal Vehicle Image Dataset for Fine-Grained Vehicle Categorization
    Lu, Lei
    Wang, Ping
    Huang, Hua
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (03) : 1818 - 1828
  • [26] Scheduling parallel processes and load balancing in large-scale computing systems
    Kutepov, V. P.
    DCABES 2007 Proceedings, Vols I and II, 2007, : 444 - 448
  • [27] UrbanBIS: a Large-scale Benchmark for Fine-grained Urban Building Instance Segmentation
    Yang, Guoqing
    Xue, Fuyou
    Zhang, Qi
    Xie, Ke
    Fu, Chi-Wing
    Huang, Hui
    PROCEEDINGS OF SIGGRAPH 2023 CONFERENCE PAPERS, SIGGRAPH 2023, 2023,
  • [28] Fine-grained distributed averaging for large-scale radio interferometric measurement sets
    Shou-Lin Wei
    Kai-Da Luo
    Feng Wang
    Hui Deng
    Ying Mei
    Research in Astronomy and Astrophysics, 2021, 21 (04) : 17 - 24
  • [29] Learning fine-grained features via a CNN Tree for Large-scale Classification
    Wang, Zhenhua
    Wang, Xingxing
    Wang, Gang
    NEUROCOMPUTING, 2018, 275 : 1231 - 1240
  • [30] AMP-SPACE: A LARGE-SCALE DATASET FOR FINE-GRAINED TIMBRE TRANSFORMATION
    Naradowsky, Jason
    2021 24TH INTERNATIONAL CONFERENCE ON DIGITAL AUDIO EFFECTS (DAFX), 2021, : 57 - 64