DLB: A Dynamic Load Balance Strategy for Distributed Training of Deep Neural Networks

被引：2

作者：

Ye, Qing ^{[1
]}

Zhou, Yuhao ^{[1
]}

Shi, Mingjia ^{[1
]}

Sun, Yanan ^{[1
]}

Lv, Jiancheng ^{[1
]}

机构：

[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China

来源：

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2023年 / 7卷 / 04期

基金：

美国国家科学基金会;

关键词：

Local SGD; Load balance; Straggler problem; Distributed DNN training;

D O I：

10.1109/TETCI.2022.3220224

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Synchronous strategies with data parallelism are widely utilized in distributed training of Deep Neural Networks (DNNs), largely owing to their easy implementation yet promising performance. In these strategies, the workers with different computational capabilities need to wait for each other because of the essential gradient or weight synchronization. This will inevitably cause the high-performance workers to waste time waiting for the weak computational workers, which in turn results in the inefficiency of the cluster. In this paper, we propose a Dynamic Load Balance (DLB) strategy for the distributed training of DNNs to tackle this issue. Specifically, the performance of each worker is evaluated first based on the performance demonstration during the previous training epochs, and then the batch size and dataset partition are adaptively adjusted in consideration of the current performance of the workers. As a result, the waiting cost among the workers will be eliminated, thereby the utilization of the clusters is highly improved. Furthermore, the essential theoretical analysis has also been provided to justify the convergence of the proposed algorithm. Extensive experiments have been conducted on the CIFAR10 and CIFAR100 benchmark datasets with four state-of-the-art DNN models. The experimental results indicate that the proposed algorithm can significantly improve the utilization of the distributed cluster. In addition, the proposed algorithm can also prevent the load imbalance of the distributed DNN training from being affected by the disturbance and can be employed flexibly in conjunction with the other synchronous distributed DNN training methods.

引用

页码：1217 / 1227

页数：11

共 50 条

[1] Model-Aware Parallelization Strategy for Deep Neural Networks' Distributed Training
Yang, Zhaoyi
Dong, Fang
[J]. 2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 61 - 66
[2] An Optimization Strategy for Deep Neural Networks Training
Wu, Tingting
Zeng, Peng
Song, Chunhe
[J]. 2022 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, COMPUTER VISION AND MACHINE LEARNING (ICICML), 2022, : 596 - 603
[3] Accelerating Training for Distributed Deep Neural Networks in MapReduce
Xu, Jie
Wang, Jingyu
Qi, Qi
Sun, Haifeng
Liao, Jianxin
[J]. WEB SERVICES - ICWS 2018, 2018, 10966 : 181 - 195
[4] Training deep neural networks: a static load balancing approach
Moreno-Alvarez, Sergio
Haut, Juan M.
Paoletti, Mercedes E.
Rico-Gallego, Juan A.
Diaz-Martin, Juan C.
Plaza, Javier
[J]. JOURNAL OF SUPERCOMPUTING, 2020, 76 (12): : 9739 - 9754
[5] Training deep neural networks: a static load balancing approach
Sergio Moreno-Álvarez
Juan M. Haut
Mercedes E. Paoletti
Juan A. Rico-Gallego
Juan C. Díaz-Martín
Javier Plaza
[J]. The Journal of Supercomputing, 2020, 76 : 9739 - 9754
[6] An In-Depth Analysis of Distributed Training of Deep Neural Networks
Ko, Yunyong
Choi, Kibong
Seo, Jiwon
Kim, Sang-Wook
[J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 994 - 1003
[7] Distributed training of deep neural networks with spark: The MareNostrum experience
Cruz, Leonel
Tous, Ruben
Otero, Beatriz
[J]. PATTERN RECOGNITION LETTERS, 2019, 125 : 174 - 178
[8] Parallel and Distributed Training of Deep Neural Networks: A brief overview
Farkas, Attila
Kertesz, Gabor
Lovas, Robert
[J]. 2020 IEEE 24TH INTERNATIONAL CONFERENCE ON INTELLIGENT ENGINEERING SYSTEMS (INES 2020), 2020, : 165 - 170
[9] A Hitchhiker's Guide On Distributed Training Of Deep Neural Networks
Chahal, Karanbir Singh
Grover, Manraj Singh
Dey, Kuntal
Shah, Rajiv Ratn
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2020, 137 : 65 - 76
[10] Alleviating Imbalance in Synchronous Distributed Training of Deep Neural Networks
Lin, Haiyang
Yan, Mingyu
Wang, Duo
Li, Wenming
Ye, Xiaochun
Tang, Zhimin
Fan, Dongrui
[J]. 19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 405 - 412

← 1 2 3 4 5 →