Live Gradient Compensation for Evading Stragglers in Distributed Learning

被引：13

作者：

Xu, Jian ^{[1
]}

Huang, Shao-Lun ^{[1
]}

Song, Linqi ^{[2
]}

Lan, Tian ^{[3
]}

机构：

[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Beijing, Peoples R China

[2] City Univ Hong Kong, Hong Kong, Peoples R China

[3] George Washington Univ, Washington, DC 20052 USA

来源：

IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2021) | 2021年

基金：

中国国家自然科学基金;

关键词：

Straggler; Distributed Learning; Non-IID; Gradient Compensation; OPTIMIZATION;

D O I：

10.1109/INFOCOM42981.2021.9488815

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The training efficiency of distributed learning systems is vulnerable to stragglers, namely, those slow worker nodes. A naive strategy is performing the distributed learning by incorporating the fastest K workers and ignoring these stragglers, which may induce high deviation for non-IID data. To tackle this, we develop a Live Gradient Compensation (LGC) strategy to incorporate the one-step delayed gradients from stragglers, aiming to accelerate learning process and utilize the stragglers simultaneously. In LGC framework, mini-batch data are divided into smaller blocks and processed separately, which makes the gradient computed based on partial work accessible. In addition, we provide theoretical convergence analysis of our algorithm for non-convex optimization problem under non-IID training data to show that LGC-SGD has almost the same convergence error as full synchronous SGD. The theoretical results also allow us to quantify a novel tradeoff in minimizing training time and error by selecting the optimal straggler threshold. Finally, extensive simulation experiments of image classification on CIFAR-10 dataset are conducted, and the numerical results demonstrate the effectiveness of our proposed strategy.

引用

页数：10

共 50 条

[1] Gradient Coding: Avoiding Stragglers in Distributed Learning
Tandon, Rashish
Lei, Qi
Dimakis, Alexandros G.
Karampatziakis, Nikos
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[2] Distributed Learning Based on 1-Bit Gradient Coding in the Presence of Stragglers
Li, Chengxi
Skoglund, Mikael
[J]. IEEE TRANSACTIONS ON COMMUNICATIONS, 2024, 72 (08) : 4903 - 4916
[3] Optimization-Based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning
Wang, Qi
Cui, Ying
Li, Chenglin
Zou, Junni
Xiong, Hongkai
[J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2023, 71 : 1023 - 1038
[4] Balancing Stragglers Against Staleness in Distributed Deep Learning
Basu, Saurav
Saxena, Vaibhav
Panja, Rintu
Verma, Ashish
[J]. 2018 IEEE 25TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2018, : 12 - 21
[5] Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers
Hanna, Serge Kas
Bitar, Rawad
Parag, Parimal
Dasari, Venkat
El Rouayheb, Salim
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4262 - 4266
[6] Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers
Ozfatura, Emre
Gunduz, Deniz
Ulukus, Sennur
[J]. 2019 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2019, : 2729 - 2733
[7] Grouping Synchronous to Eliminate Stragglers with Edge Computing in Distributed Deep Learning
Gui, Zhiyi
Yang, Xiang
Yang, Hao
Li, Wei
Zhang, Lei
Qi, Qi
Wang, Jingyu
Sun, Haifeng
Liao, Jianxin
[J]. 19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 429 - 436
[8] Robust Distributed Bayesian Learning with Stragglers via Consensus Monte Carlo
Chittoor, Hari Hara Suthan
Simeone, Osvaldo
[J]. 2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 609 - 614
[9] LEARNING TO LIVE IN A DISTRIBUTED WORLD
AHITUV, N
SADAN, B
[J]. DATAMATION, 1985, 31 (18): : 139 - &
[10] Stragglers in Distributed Matrix Multiplication
Nissim, Roy
Schwartz, Oded
[J]. JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, JSSPP 2023, 2023, 14283 : 74 - 96

← 1 2 3 4 5 →