Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments

被引:1
|
作者
Du, Haizhou [1 ]
Huang, Sheng [1 ]
Xiang, Qiao [2 ]
机构
[1] Shanghai Univ Elect Power, Shanghai, Peoples R China
[2] Xiamen Univ, Xiamen, Peoples R China
关键词
Distributed Deep Learning; Local Update Adaptation; Load-Balance; Heterogeneous Environments;
D O I
10.1145/3528416.3530246
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.
引用
收藏
页码:181 / 184
页数:4
相关论文
共 50 条
  • [1] Decentralized Distributed Deep Learning in Heterogeneous WAN Environments
    Hong, Rankyung
    Chandra, Abhishek
    PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 505 - 505
  • [2] HeterPS: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments
    Liu, Ji
    Wu, Zhihua
    Feng, Danlei
    Zhang, Minxu
    Wu, Xinxuan
    Yao, Xuefeng
    Yu, Dianhai
    Ma, Yanjun
    Zhao, Feng
    Dou, Dejing
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 148 : 106 - 117
  • [3] Accelerating geostatistical seismic inversion using TensorFlow: A heterogeneous distributed deep learning framework
    Liu, Mingliang
    Grana, Dario
    COMPUTERS & GEOSCIENCES, 2019, 124 : 37 - 45
  • [4] An Incremental Iterative Acceleration Architecture in Distributed Heterogeneous Environments With GPUs for Deep Learning
    Zhang, Xuedong
    Tang, Zhuo
    Du, Lifan
    Yang, Li
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (11) : 2823 - 2837
  • [5] Accelerating Distributed Learning in Non-Dedicated Environments
    Chen, Chen
    Weng, Qizhen
    Wang, Wei
    Li, Baochun
    Li, Bo
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (01) : 515 - 531
  • [6] PipeCompress: Accelerating Pipelined Communication for Distributed Deep Learning
    Liu, Juncai
    Wang, Jessie Hui
    Rong, Chenghao
    Wang, Jilong
    IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2022), 2022, : 207 - 212
  • [7] ACCELERATING DISTRIBUTED DEEP LEARNING BY ADAPTIVE GRADIENT QUANTIZATION
    Guo, Jinrong
    Liu, Wantao
    Wang, Wang
    Han, Jizhong
    Li, Ruixuan
    Lu, Yijun
    Hu, Songlin
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 1603 - 1607
  • [8] ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous Environments
    Shen, Zhaoyan
    Tang, Qingxiang
    Zhou, Tianren
    Zhang, Yuhao
    Jia, Zhiping
    Yu, Dongxiao
    Zhang, Zhiyong
    Li, Bingzhe
    IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (01) : 30 - 43
  • [9] The Application of Deep Reinforcement Learning to Distributed Spectrum Access in Dynamic Heterogeneous Environments With Partial Observations
    Xu, Yue
    Yu, Jianyuan
    Buehrer, R. Michael
    IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2020, 19 (07) : 4494 - 4506
  • [10] Communication Optimization Schemes for Accelerating Distributed Deep Learning Systems
    Lee, Jaehwan
    Choi, Hyeonseong
    Jeong, Hyeonwoo
    Noh, Baekhyeon
    Shin, Ji Sun
    APPLIED SCIENCES-BASEL, 2020, 10 (24): : 1 - 15