Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

被引：1

作者：

Wang, Wei ^{[1
]}

Lai, Zhiquan ^{[1
]}

Li, Shengwei ^{[1
]}

Liu, Weijie ^{[1
]}

Ge, Keshi ^{[1
]}

Liu, Yujie ^{[1
]}

Shen, Ao ^{[1
]}

Li, Dongsheng ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Comp, Natl Lab Parallel & Distributed Proc PDL, Changsha, Peoples R China

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, CLUSTER | 2023年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

mixture of experts; distributed training;

D O I：

10.1109/CLUSTER52292.2023.00015

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-large size with negligible increases in computation. The MoE model has achieved the highest accuracy in several domains. However, a significant load imbalance occurs in the device during the training of a MoE model, resulting in significantly reduced throughput. Previous works on load balancing either harm model convergence or suffer from high execution overhead. To address these issues, we present Prophet: a fine-grained load balancing method for parallel training of large-scale MoE models, which consists of a planner and a scheduler. Prophet planner first employs a fine-grained resource allocation method to determine the possible scenarios for the expert placement in a fine-grained manner, and then efficiently searches for a well-balanced expert placement to balance the load without introducing additional overhead. Prophet scheduler exploits the locality of the token distribution to schedule the resource allocation operations using a layer-wise fine-grained schedule strategy to hide their overhead. We conduct extensive experiments in four clusters and five representative models. The results indicate that Prophet gains up to 2.3x speedup compared to the state-of-the-art MoE frameworks including Deepspeed-MoE and FasterMoE. Additionally, Prophet achieves a load balancing enhancement of up to 12.06x when compared to FasterMoE.

引用

页码：82 / 94

页数：13

共 50 条

[41] Congruent fine-grained data mining model for large-scale medical data mining
Kumari, J. Arthi Jaya
Ghalib, Muhammad Rukunddin
INTERNATIONAL JOURNAL OF INTERNET PROTOCOL TECHNOLOGY, 2022, 15 (3-4) : 148 - 160
[42] SEMICON: A Learning-to-Hash Solution for Large-Scale Fine-Grained Image Retrieval
Shen, Yang
Sun, Xuhao
Wei, Xiu-Shen
Jiang, Qing-Yuan
Yang, Jian
COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 531 - 548
[43] A load balancing parallel algorithm for solving large-scale tridiagonal linear systems
Tian, Min
Qiao, Shan
Wang, Junjie
Du, Wei
INTERNATIONAL CONFERENCE ON ALGORITHMS, HIGH PERFORMANCE COMPUTING, AND ARTIFICIAL INTELLIGENCE (AHPCAI 2021), 2021, 12156
[44] Towards fine-grained load balancing with dynamical flowlet timeout in datacenter networks
Hu, Jinbin
Li, Ruiqian
Liu, Ying
Wang, Jin
COMPUTER NETWORKS, 2024, 255
[45] ENABLING FINE-GRAINED LOAD BALANCING FOR VIRTUAL WORLDS WITH DISTRIBUTED SIMULATION ENGINES
Valadares, Arthur
Lopes, Cristina Videira
Liu, Huaiyu
PROCEEDINGS OF THE 2014 WINTER SIMULATION CONFERENCE (WSC), 2014, : 3459 - 3470
[46] Fine-grained load balancing with traffic-aware rerouting in datacenter networks
Zhang, Tao
Lei, Yasi
Zhang, Qianqiang
Zou, Shaojun
Huang, Juan
Li, Fangmin
JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS, 2021, 10 (01):
[47] Nem: Toward Fine-grained Load Balancing through RNIC EC Offloading
Wang, Xiaoliang
Nguyen, Cam-Tu
Ye, Baoliu
Qian, Zhuzhong
Tang, Bin
Li, Wenzhong
Lu, Sanglu
2018 IEEE 19TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE SWITCHING AND ROUTING (IEEE HPSR), 2018,
[48] Fine-grained load balancing with traffic-aware rerouting in datacenter networks
Tao Zhang
Yasi Lei
Qianqiang Zhang
Shaojun Zou
Juan Huang
Fangmin Li
Journal of Cloud Computing, 10
[49] Fine-grained load balancing with proactive prediction and adaptive rerouting in data center
Gao, Weimin
Zhong, Jiaming
Peng, Caihong
Li, Xinlong
Liao, Xiangbai
JOURNAL OF HIGH SPEED NETWORKS, 2024, 30 (01) : 83 - 96
[50] RMC: Reordering Marking and Coding for Fine-Grained Load Balancing in Data Centers
Zou, Shaojun
Huang, Jiawei
Wang, Jianxin
He, Tian
IEEE TRANSACTIONS ON COMMUNICATIONS, 2021, 69 (12) : 8363 - 8374

← 1 2 3 4 5 →