Genuinely distributed Byzantine machine learning

被引:0
|
作者
El-Mahdi El-Mhamdi
Rachid Guerraoui
Arsany Guirguis
Lê-Nguyên Hoang
Sébastien Rouault
机构
[1] Ecole Polytechnique Fédérale de Lausanne (EPFL),School of Computer and Communication Sciences (IC)
来源
Distributed Computing | 2022年 / 35卷
关键词
Distributed machine learning; Robust machine learning; Byzantine fault tolerance; Byzantine parameter servers;
D O I
暂无
中图分类号
学科分类号
摘要
Machine learning (ML) solutions are nowadays distributed, according to the so-called server/worker architecture. One server holds the model parameters while several workers train the model. Clearly, such architecture is prone to various types of component failures, which can be all encompassed within the spectrum of a Byzantine behavior. Several approaches have been proposed recently to tolerate Byzantine workers. Yet all require trusting a central parameter server. We initiate in this paper the study of the “general” Byzantine-resilient distributed machine learning problem where no individual component is trusted. In particular, we distribute the parameter server computation on several nodes. We show that this problem can be solved in an asynchronous system, despite the presence of 13\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{3}$$\end{document} Byzantine parameter servers (i.e., nps>3fps+1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{ps} > 3f_{ps}+1$$\end{document}) and 13\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{3}$$\end{document} Byzantine workers (i.e., nw>3fw\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_w > 3f_w$$\end{document}), which is asymptotically optimal. We present a new algorithm, ByzSGD, which solves the general Byzantine-resilient distributed machine learning problem by relying on three major schemes. The first, scatter/gather, is a communication scheme whose goal is to bound the maximum drift among models on correct servers. The second, distributed median contraction (DMC), leverages the geometric properties of the median in high dimensional spaces to bring parameters within the correct servers back close to each other, ensuring safe and lively learning. The third, Minimum-diameter averaging (MDA), is a statistically-robust gradient aggregation rule whose goal is to tolerate Byzantine workers. MDA requires a loose bound on the variance of non-Byzantine gradient estimates, compared to existing alternatives [e.g., Krum (Blanchard et al., in: Neural information processing systems, pp 118-128, 2017)]. Interestingly, ByzSGD ensures Byzantine resilience without adding communication rounds (on a normal path), compared to vanilla non-Byzantine alternatives. ByzSGD requires, however, a larger number of messages which, we show, can be reduced if we assume synchrony. We implemented ByzSGD on top of both TensorFlow and PyTorch, and we report on our evaluation results. In particular, we show that ByzSGD guarantees convergence with around 32% overhead compared to vanilla SGD. Furthermore, we show that ByzSGD’s throughput overhead is 24–176% in the synchronous case and 28–220% in the asynchronous case.
引用
收藏
页码:305 / 331
页数:26
相关论文
共 50 条
  • [31] Self-stabilizing Byzantine-Tolerant Distributed Replicated State Machine
    Binun, Alexander
    Coupaye, Thierry
    Dolev, Shlomi
    Kassi-Lahlou, Mohammed
    Lacoste, Marc
    Palesandro, Alex
    Yagel, Reuven
    Yankulin, Leonid
    STABILIZATION, SAFETY, AND SECURITY OF DISTRIBUTED SYSTEMS, SSS 2016, 2016, 10083 : 36 - 53
  • [32] Adversary-Resilient Distributed and Decentralized Statistical Inference and Machine Learning: An Overview of Recent Advances Under the Byzantine Threat Model
    Yang, Zhixiong
    Gang, Arpita
    Bajwa, Waheed U.
    IEEE SIGNAL PROCESSING MAGAZINE, 2020, 37 (03) : 146 - 159
  • [33] Robust Distributed Learning Against Both Distributional Shifts and Byzantine Attacks
    Zhou, Guanqiang
    Xu, Ping
    Wang, Yue
    Tian, Zhi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [34] Resilient Mechanism Against Byzantine Failure for Distributed Deep Reinforcement Learning
    Zhang, Mingyue
    Jin, Zhi
    Hou, Jian
    Luo, Renwei
    2022 IEEE 33RD INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING (ISSRE 2022), 2022, : 378 - 389
  • [35] Byzantine-robust distributed sparse learning for M-estimation
    Jiyuan Tu
    Weidong Liu
    Xiaojun Mao
    Machine Learning, 2023, 112 : 3773 - 3804
  • [36] Distributed Statistical Min-Max Learning in the Presence of Byzantine Agents
    Adibi, Arman
    Mitra, Aritra
    Pappas, George J.
    Hassani, Hamed
    2022 IEEE 61ST CONFERENCE ON DECISION AND CONTROL (CDC), 2022, : 4179 - 4184
  • [37] Byzantine-robust distributed sparse learning for M-estimation
    Tu, Jiyuan
    Liu, Weidong
    Mao, Xiaojun
    MACHINE LEARNING, 2023, 112 (10) : 3773 - 3804
  • [38] A new system for distributed machine learning
    Jianyong Wang
    National Science Review, 2018, 5 (03) : 303 - 304
  • [39] Distributed machine learning and sparse representations
    Obst, Oliver
    NEUROCOMPUTING, 2014, 124 : 1 - 1
  • [40] Distributed Bayesian Machine Learning Procedures
    Biletskyy, B.
    CYBERNETICS AND SYSTEMS ANALYSIS, 2019, 55 (03) : 456 - 461