Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

被引:1
|
作者
Wang, Wei [1 ]
Lai, Zhiquan [1 ]
Li, Shengwei [1 ]
Liu, Weijie [1 ]
Ge, Keshi [1 ]
Liu, Yujie [1 ]
Shen, Ao [1 ]
Li, Dongsheng [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp, Natl Lab Parallel & Distributed Proc PDL, Changsha, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
mixture of experts; distributed training;
D O I
10.1109/CLUSTER52292.2023.00015
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-large size with negligible increases in computation. The MoE model has achieved the highest accuracy in several domains. However, a significant load imbalance occurs in the device during the training of a MoE model, resulting in significantly reduced throughput. Previous works on load balancing either harm model convergence or suffer from high execution overhead. To address these issues, we present Prophet: a fine-grained load balancing method for parallel training of large-scale MoE models, which consists of a planner and a scheduler. Prophet planner first employs a fine-grained resource allocation method to determine the possible scenarios for the expert placement in a fine-grained manner, and then efficiently searches for a well-balanced expert placement to balance the load without introducing additional overhead. Prophet scheduler exploits the locality of the token distribution to schedule the resource allocation operations using a layer-wise fine-grained schedule strategy to hide their overhead. We conduct extensive experiments in four clusters and five representative models. The results indicate that Prophet gains up to 2.3x speedup compared to the state-of-the-art MoE frameworks including Deepspeed-MoE and FasterMoE. Additionally, Prophet achieves a load balancing enhancement of up to 12.06x when compared to FasterMoE.
引用
收藏
页码:82 / 94
页数:13
相关论文
共 50 条
  • [41] Congruent fine-grained data mining model for large-scale medical data mining
    Kumari, J. Arthi Jaya
    Ghalib, Muhammad Rukunddin
    INTERNATIONAL JOURNAL OF INTERNET PROTOCOL TECHNOLOGY, 2022, 15 (3-4) : 148 - 160
  • [42] SEMICON: A Learning-to-Hash Solution for Large-Scale Fine-Grained Image Retrieval
    Shen, Yang
    Sun, Xuhao
    Wei, Xiu-Shen
    Jiang, Qing-Yuan
    Yang, Jian
    COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 531 - 548
  • [43] A load balancing parallel algorithm for solving large-scale tridiagonal linear systems
    Tian, Min
    Qiao, Shan
    Wang, Junjie
    Du, Wei
    INTERNATIONAL CONFERENCE ON ALGORITHMS, HIGH PERFORMANCE COMPUTING, AND ARTIFICIAL INTELLIGENCE (AHPCAI 2021), 2021, 12156
  • [44] Towards fine-grained load balancing with dynamical flowlet timeout in datacenter networks
    Hu, Jinbin
    Li, Ruiqian
    Liu, Ying
    Wang, Jin
    COMPUTER NETWORKS, 2024, 255
  • [45] ENABLING FINE-GRAINED LOAD BALANCING FOR VIRTUAL WORLDS WITH DISTRIBUTED SIMULATION ENGINES
    Valadares, Arthur
    Lopes, Cristina Videira
    Liu, Huaiyu
    PROCEEDINGS OF THE 2014 WINTER SIMULATION CONFERENCE (WSC), 2014, : 3459 - 3470
  • [46] Fine-grained load balancing with traffic-aware rerouting in datacenter networks
    Zhang, Tao
    Lei, Yasi
    Zhang, Qianqiang
    Zou, Shaojun
    Huang, Juan
    Li, Fangmin
    JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS, 2021, 10 (01):
  • [47] Nem: Toward Fine-grained Load Balancing through RNIC EC Offloading
    Wang, Xiaoliang
    Nguyen, Cam-Tu
    Ye, Baoliu
    Qian, Zhuzhong
    Tang, Bin
    Li, Wenzhong
    Lu, Sanglu
    2018 IEEE 19TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE SWITCHING AND ROUTING (IEEE HPSR), 2018,
  • [48] Fine-grained load balancing with traffic-aware rerouting in datacenter networks
    Tao Zhang
    Yasi Lei
    Qianqiang Zhang
    Shaojun Zou
    Juan Huang
    Fangmin Li
    Journal of Cloud Computing, 10
  • [49] Fine-grained load balancing with proactive prediction and adaptive rerouting in data center
    Gao, Weimin
    Zhong, Jiaming
    Peng, Caihong
    Li, Xinlong
    Liao, Xiangbai
    JOURNAL OF HIGH SPEED NETWORKS, 2024, 30 (01) : 83 - 96
  • [50] RMC: Reordering Marking and Coding for Fine-Grained Load Balancing in Data Centers
    Zou, Shaojun
    Huang, Jiawei
    Wang, Jianxin
    He, Tian
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2021, 69 (12) : 8363 - 8374