Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments

被引：1

作者：

Du, Haizhou ^{[1
]}

Huang, Sheng ^{[1
]}

Xiang, Qiao ^{[2
]}

机构：

[1] Shanghai Univ Elect Power, Shanghai, Peoples R China

[2] Xiamen Univ, Xiamen, Peoples R China

来源：

PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2022 (CF 2022) | 2022年

关键词：

Distributed Deep Learning; Local Update Adaptation; Load-Balance; Heterogeneous Environments;

D O I：

10.1145/3528416.3530246

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.

引用

页码：181 / 184

页数：4

共 50 条

[31] Accelerating deep learning with precision
Owain Vaughan
Nature Electronics, 2022, 5 : 411 - 411
[32] Accelerating deep learning with precision
Vaughan, Owain
NATURE ELECTRONICS, 2022, 5 (07) : 411 - 411
[33] Distributed learning environments
Alavi, M
COMPUTER, 2004, 37 (01) : 121 - 122
[34] Engineering heterogeneous distributed learning environments using tuple spaces as an architectural platform
Weinbrenner, Stefan
Giemza, Adam
Hoppe, H. Ulrich
7TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED LEARNING TECHNOLOGIES, PROCEEDINGS, 2007, : 434 - +
[35] Distributed Deep Learning on Heterogeneous Computing Resources Using Gossip Communication
Georgiev, Dobromir
Gurov, Todor
LARGE-SCALE SCIENTIFIC COMPUTING (LSSC 2019), 2020, 11958 : 220 - 227
[36] Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems
Zeng, Yifu
Chen, Bowei
Pan, Pulin
Li, Kenli
Chen, Guo
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2023, 2023
[37] Distributed Inference with Deep Learning Models across Heterogeneous Edge Devices
Hu, Chenghao
Li, Baochun
IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2022), 2022, : 330 - 339
[38] Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness
Li, Qingping
Xu, Jingwei
Cao, Chun
THE 12TH ASIA-PACIFIC SYMPOSIUM ON INTERNETWARE, INTERNETWARE 2020, 2021, : 217 - 228
[39] Heterogeneous Logical Environments for Distributed Specifications
Mossakowski, Till
Tarlecki, Andrzej
RECENT TRENDS IN ALGEBRAIC DEVELOPMENT TECHNIQUES, 2009, 5486 : 266 - +
[40] Distributed Evolutionary Algorithms in Heterogeneous Environments
Salto, Carolina
Luna, Francisco
Alba, Enrique
2013 EIGHTH INTERNATIONAL CONFERENCE ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC 2013), 2013, : 606 - 611

← 1 2 3 4 5 →