DLB: A Dynamic Load Balance Strategy for Distributed Training of Deep Neural Networks

被引:2
|
作者
Ye, Qing [1 ]
Zhou, Yuhao [1 ]
Shi, Mingjia [1 ]
Sun, Yanan [1 ]
Lv, Jiancheng [1 ]
机构
[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
基金
美国国家科学基金会;
关键词
Local SGD; Load balance; Straggler problem; Distributed DNN training;
D O I
10.1109/TETCI.2022.3220224
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Synchronous strategies with data parallelism are widely utilized in distributed training of Deep Neural Networks (DNNs), largely owing to their easy implementation yet promising performance. In these strategies, the workers with different computational capabilities need to wait for each other because of the essential gradient or weight synchronization. This will inevitably cause the high-performance workers to waste time waiting for the weak computational workers, which in turn results in the inefficiency of the cluster. In this paper, we propose a Dynamic Load Balance (DLB) strategy for the distributed training of DNNs to tackle this issue. Specifically, the performance of each worker is evaluated first based on the performance demonstration during the previous training epochs, and then the batch size and dataset partition are adaptively adjusted in consideration of the current performance of the workers. As a result, the waiting cost among the workers will be eliminated, thereby the utilization of the clusters is highly improved. Furthermore, the essential theoretical analysis has also been provided to justify the convergence of the proposed algorithm. Extensive experiments have been conducted on the CIFAR10 and CIFAR100 benchmark datasets with four state-of-the-art DNN models. The experimental results indicate that the proposed algorithm can significantly improve the utilization of the distributed cluster. In addition, the proposed algorithm can also prevent the load imbalance of the distributed DNN training from being affected by the disturbance and can be employed flexibly in conjunction with the other synchronous distributed DNN training methods.
引用
收藏
页码:1217 / 1227
页数:11
相关论文
共 50 条
  • [1] Model-Aware Parallelization Strategy for Deep Neural Networks' Distributed Training
    Yang, Zhaoyi
    Dong, Fang
    [J]. 2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 61 - 66
  • [2] An Optimization Strategy for Deep Neural Networks Training
    Wu, Tingting
    Zeng, Peng
    Song, Chunhe
    [J]. 2022 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, COMPUTER VISION AND MACHINE LEARNING (ICICML), 2022, : 596 - 603
  • [3] Accelerating Training for Distributed Deep Neural Networks in MapReduce
    Xu, Jie
    Wang, Jingyu
    Qi, Qi
    Sun, Haifeng
    Liao, Jianxin
    [J]. WEB SERVICES - ICWS 2018, 2018, 10966 : 181 - 195
  • [4] Training deep neural networks: a static load balancing approach
    Moreno-Alvarez, Sergio
    Haut, Juan M.
    Paoletti, Mercedes E.
    Rico-Gallego, Juan A.
    Diaz-Martin, Juan C.
    Plaza, Javier
    [J]. JOURNAL OF SUPERCOMPUTING, 2020, 76 (12): : 9739 - 9754
  • [5] Training deep neural networks: a static load balancing approach
    Sergio Moreno-Álvarez
    Juan M. Haut
    Mercedes E. Paoletti
    Juan A. Rico-Gallego
    Juan C. Díaz-Martín
    Javier Plaza
    [J]. The Journal of Supercomputing, 2020, 76 : 9739 - 9754
  • [6] An In-Depth Analysis of Distributed Training of Deep Neural Networks
    Ko, Yunyong
    Choi, Kibong
    Seo, Jiwon
    Kim, Sang-Wook
    [J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 994 - 1003
  • [7] Distributed training of deep neural networks with spark: The MareNostrum experience
    Cruz, Leonel
    Tous, Ruben
    Otero, Beatriz
    [J]. PATTERN RECOGNITION LETTERS, 2019, 125 : 174 - 178
  • [8] Parallel and Distributed Training of Deep Neural Networks: A brief overview
    Farkas, Attila
    Kertesz, Gabor
    Lovas, Robert
    [J]. 2020 IEEE 24TH INTERNATIONAL CONFERENCE ON INTELLIGENT ENGINEERING SYSTEMS (INES 2020), 2020, : 165 - 170
  • [9] A Hitchhiker's Guide On Distributed Training Of Deep Neural Networks
    Chahal, Karanbir Singh
    Grover, Manraj Singh
    Dey, Kuntal
    Shah, Rajiv Ratn
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2020, 137 : 65 - 76
  • [10] Alleviating Imbalance in Synchronous Distributed Training of Deep Neural Networks
    Lin, Haiyang
    Yan, Mingyu
    Wang, Duo
    Li, Wenming
    Ye, Xiaochun
    Tang, Zhimin
    Fan, Dongrui
    [J]. 19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 405 - 412