Live Gradient Compensation for Evading Stragglers in Distributed Learning

被引:13
|
作者
Xu, Jian [1 ]
Huang, Shao-Lun [1 ]
Song, Linqi [2 ]
Lan, Tian [3 ]
机构
[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Beijing, Peoples R China
[2] City Univ Hong Kong, Hong Kong, Peoples R China
[3] George Washington Univ, Washington, DC 20052 USA
基金
中国国家自然科学基金;
关键词
Straggler; Distributed Learning; Non-IID; Gradient Compensation; OPTIMIZATION;
D O I
10.1109/INFOCOM42981.2021.9488815
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The training efficiency of distributed learning systems is vulnerable to stragglers, namely, those slow worker nodes. A naive strategy is performing the distributed learning by incorporating the fastest K workers and ignoring these stragglers, which may induce high deviation for non-IID data. To tackle this, we develop a Live Gradient Compensation (LGC) strategy to incorporate the one-step delayed gradients from stragglers, aiming to accelerate learning process and utilize the stragglers simultaneously. In LGC framework, mini-batch data are divided into smaller blocks and processed separately, which makes the gradient computed based on partial work accessible. In addition, we provide theoretical convergence analysis of our algorithm for non-convex optimization problem under non-IID training data to show that LGC-SGD has almost the same convergence error as full synchronous SGD. The theoretical results also allow us to quantify a novel tradeoff in minimizing training time and error by selecting the optimal straggler threshold. Finally, extensive simulation experiments of image classification on CIFAR-10 dataset are conducted, and the numerical results demonstrate the effectiveness of our proposed strategy.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Gradient Coding: Avoiding Stragglers in Distributed Learning
    Tandon, Rashish
    Lei, Qi
    Dimakis, Alexandros G.
    Karampatziakis, Nikos
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [2] Distributed Learning Based on 1-Bit Gradient Coding in the Presence of Stragglers
    Li, Chengxi
    Skoglund, Mikael
    [J]. IEEE TRANSACTIONS ON COMMUNICATIONS, 2024, 72 (08) : 4903 - 4916
  • [3] Optimization-Based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning
    Wang, Qi
    Cui, Ying
    Li, Chenglin
    Zou, Junni
    Xiong, Hongkai
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2023, 71 : 1023 - 1038
  • [4] Balancing Stragglers Against Staleness in Distributed Deep Learning
    Basu, Saurav
    Saxena, Vaibhav
    Panja, Rintu
    Verma, Ashish
    [J]. 2018 IEEE 25TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2018, : 12 - 21
  • [5] Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers
    Hanna, Serge Kas
    Bitar, Rawad
    Parag, Parimal
    Dasari, Venkat
    El Rouayheb, Salim
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4262 - 4266
  • [6] Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers
    Ozfatura, Emre
    Gunduz, Deniz
    Ulukus, Sennur
    [J]. 2019 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2019, : 2729 - 2733
  • [7] Grouping Synchronous to Eliminate Stragglers with Edge Computing in Distributed Deep Learning
    Gui, Zhiyi
    Yang, Xiang
    Yang, Hao
    Li, Wei
    Zhang, Lei
    Qi, Qi
    Wang, Jingyu
    Sun, Haifeng
    Liao, Jianxin
    [J]. 19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 429 - 436
  • [8] Robust Distributed Bayesian Learning with Stragglers via Consensus Monte Carlo
    Chittoor, Hari Hara Suthan
    Simeone, Osvaldo
    [J]. 2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 609 - 614
  • [9] LEARNING TO LIVE IN A DISTRIBUTED WORLD
    AHITUV, N
    SADAN, B
    [J]. DATAMATION, 1985, 31 (18): : 139 - &
  • [10] Stragglers in Distributed Matrix Multiplication
    Nissim, Roy
    Schwartz, Oded
    [J]. JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, JSSPP 2023, 2023, 14283 : 74 - 96