Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments

被引：1

作者：

Du, Haizhou ^{[1
]}

Huang, Sheng ^{[1
]}

Xiang, Qiao ^{[2
]}

机构：

[1] Shanghai Univ Elect Power, Shanghai, Peoples R China

[2] Xiamen Univ, Xiamen, Peoples R China

来源：

PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2022 (CF 2022) | 2022年

关键词：

Distributed Deep Learning; Local Update Adaptation; Load-Balance; Heterogeneous Environments;

D O I：

10.1145/3528416.3530246

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.

引用

页码：181 / 184

页数：4

共 50 条

[1] Decentralized Distributed Deep Learning in Heterogeneous WAN Environments
Hong, Rankyung
Chandra, Abhishek
PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 505 - 505
[2] HeterPS: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments
Liu, Ji
Wu, Zhihua
Feng, Danlei
Zhang, Minxu
Wu, Xinxuan
Yao, Xuefeng
Yu, Dianhai
Ma, Yanjun
Zhao, Feng
Dou, Dejing
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 148 : 106 - 117
[3] Accelerating geostatistical seismic inversion using TensorFlow: A heterogeneous distributed deep learning framework
Liu, Mingliang
Grana, Dario
COMPUTERS & GEOSCIENCES, 2019, 124 : 37 - 45
[4] An Incremental Iterative Acceleration Architecture in Distributed Heterogeneous Environments With GPUs for Deep Learning
Zhang, Xuedong
Tang, Zhuo
Du, Lifan
Yang, Li
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (11) : 2823 - 2837
[5] Accelerating Distributed Learning in Non-Dedicated Environments
Chen, Chen
Weng, Qizhen
Wang, Wei
Li, Baochun
Li, Bo
IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (01) : 515 - 531
[6] PipeCompress: Accelerating Pipelined Communication for Distributed Deep Learning
Liu, Juncai
Wang, Jessie Hui
Rong, Chenghao
Wang, Jilong
IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2022), 2022, : 207 - 212
[7] ACCELERATING DISTRIBUTED DEEP LEARNING BY ADAPTIVE GRADIENT QUANTIZATION
Guo, Jinrong
Liu, Wantao
Wang, Wang
Han, Jizhong
Li, Ruixuan
Lu, Yijun
Hu, Songlin
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 1603 - 1607
[8] ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous Environments
Shen, Zhaoyan
Tang, Qingxiang
Zhou, Tianren
Zhang, Yuhao
Jia, Zhiping
Yu, Dongxiao
Zhang, Zhiyong
Li, Bingzhe
IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (01) : 30 - 43
[9] The Application of Deep Reinforcement Learning to Distributed Spectrum Access in Dynamic Heterogeneous Environments With Partial Observations
Xu, Yue
Yu, Jianyuan
Buehrer, R. Michael
IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2020, 19 (07) : 4494 - 4506
[10] Communication Optimization Schemes for Accelerating Distributed Deep Learning Systems
Lee, Jaehwan
Choi, Hyeonseong
Jeong, Hyeonwoo
Noh, Baekhyeon
Shin, Ji Sun
APPLIED SCIENCES-BASEL, 2020, 10 (24): : 1 - 15

← 1 2 3 4 5 →