Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

被引:1
|
作者
Wang, Wei [1 ]
Lai, Zhiquan [1 ]
Li, Shengwei [1 ]
Liu, Weijie [1 ]
Ge, Keshi [1 ]
Liu, Yujie [1 ]
Shen, Ao [1 ]
Li, Dongsheng [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp, Natl Lab Parallel & Distributed Proc PDL, Changsha, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
mixture of experts; distributed training;
D O I
10.1109/CLUSTER52292.2023.00015
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-large size with negligible increases in computation. The MoE model has achieved the highest accuracy in several domains. However, a significant load imbalance occurs in the device during the training of a MoE model, resulting in significantly reduced throughput. Previous works on load balancing either harm model convergence or suffer from high execution overhead. To address these issues, we present Prophet: a fine-grained load balancing method for parallel training of large-scale MoE models, which consists of a planner and a scheduler. Prophet planner first employs a fine-grained resource allocation method to determine the possible scenarios for the expert placement in a fine-grained manner, and then efficiently searches for a well-balanced expert placement to balance the load without introducing additional overhead. Prophet scheduler exploits the locality of the token distribution to schedule the resource allocation operations using a layer-wise fine-grained schedule strategy to hide their overhead. We conduct extensive experiments in four clusters and five representative models. The results indicate that Prophet gains up to 2.3x speedup compared to the state-of-the-art MoE frameworks including Deepspeed-MoE and FasterMoE. Additionally, Prophet achieves a load balancing enhancement of up to 12.06x when compared to FasterMoE.
引用
收藏
页码:82 / 94
页数:13
相关论文
共 50 条
  • [31] Efficient integration of fine-grained access control in large-scale grid services
    Mazzoleni, P
    Crispo, B
    Sivasubramanian, S
    Bertino, E
    2005 IEEE INTERNATIONAL CONFERENCE ON SERVICES COMPUTING, VOL 1, PROCEEDINGS, 2005, : 77 - 84
  • [32] Fine-grained self-healing hardware for large-scale autonomic systems
    Kumar, VV
    Lach, J
    14TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2003, : 707 - 712
  • [33] Fine-grained distributed averaging for large-scale radio interferometric measurement sets
    Wei, Shou-Lin
    Luo, Kai-Da
    Wang, Feng
    Deng, Hui
    Mei, Ying
    RESEARCH IN ASTRONOMY AND ASTROPHYSICS, 2021, 21 (04)
  • [34] A fine-grained load balancing technique for improving partition-parallel-based ontology matching approaches
    Araujo, Tiago Brasileiro
    Santos Pires, Carlos Eduardo
    da Nobrega, Thiago Pereira
    Nascimento, Dimas C.
    KNOWLEDGE-BASED SYSTEMS, 2016, 111 : 17 - 26
  • [35] Fine-Grained HTTP Web Traffic Analysis Based on Large-Scale Mobile Datasets
    Fang, Cheng
    Liu, Jun
    Lei, Zhenming
    IEEE ACCESS, 2016, 4 : 4364 - 4373
  • [36] ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos
    Yu, Zhou
    Zheng, Lixiang
    Zhao, Zhou
    Wu, Fei
    Fan, Jianping
    Ren, Kui
    Yu, Jun
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23191 - 23200
  • [37] DocEE: A Large-Scale and Fine-grained Benchmark for Document-level Event Extraction
    Tong, Meihan
    Xu, Bin
    Wang, Shuai
    Han, Meihuan
    Cao, Yixin
    Zhu, Jiangqi
    Chen, Siyu
    Hou, Lei
    Li, Juanzi
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3970 - 3982
  • [38] Fine-Grained Histopathological Image Analysis via Robust Segmentation and Large-Scale Retrieval
    Zhang, Xiaofan
    Su, Hai
    Rang, Lin
    Zhang, Shaoting
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 5361 - 5368
  • [39] LFETT2021: A Large-scale Fine-grained Encrypted Tunnel Traffic Dataset
    Gu, Zheyuan
    Gou, Gaopeng
    Hou, Chengshang
    Xiong, Gang
    Li, Zhen
    2021 IEEE 20TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2021), 2021, : 240 - 249
  • [40] Large-Scale Fine-Grained Bird Recognition Based on a Triplet Network and Bilinear Model
    Zhao, Zhicheng
    Luo, Ze
    Li, Jian
    Wang, Kaihua
    Shi, Bingying
    APPLIED SCIENCES-BASEL, 2018, 8 (10):