Genuinely distributed Byzantine machine learning

被引：0

作者：

El-Mahdi El-Mhamdi

Rachid Guerraoui

Arsany Guirguis

Lê-Nguyên Hoang

Sébastien Rouault

机构：

[1] Ecole Polytechnique Fédérale de Lausanne (EPFL),School of Computer and Communication Sciences (IC)

来源：

Distributed Computing | 2022年 / 35卷

关键词：

Distributed machine learning; Robust machine learning; Byzantine fault tolerance; Byzantine parameter servers;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Machine learning (ML) solutions are nowadays distributed, according to the so-called server/worker architecture. One server holds the model parameters while several workers train the model. Clearly, such architecture is prone to various types of component failures, which can be all encompassed within the spectrum of a Byzantine behavior. Several approaches have been proposed recently to tolerate Byzantine workers. Yet all require trusting a central parameter server. We initiate in this paper the study of the “general” Byzantine-resilient distributed machine learning problem where no individual component is trusted. In particular, we distribute the parameter server computation on several nodes. We show that this problem can be solved in an asynchronous system, despite the presence of 13\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{3}$$\end{document} Byzantine parameter servers (i.e., nps>3fps+1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{ps} > 3f_{ps}+1$$\end{document}) and 13\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{3}$$\end{document} Byzantine workers (i.e., nw>3fw\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_w > 3f_w$$\end{document}), which is asymptotically optimal. We present a new algorithm, ByzSGD, which solves the general Byzantine-resilient distributed machine learning problem by relying on three major schemes. The first, scatter/gather, is a communication scheme whose goal is to bound the maximum drift among models on correct servers. The second, distributed median contraction (DMC), leverages the geometric properties of the median in high dimensional spaces to bring parameters within the correct servers back close to each other, ensuring safe and lively learning. The third, Minimum-diameter averaging (MDA), is a statistically-robust gradient aggregation rule whose goal is to tolerate Byzantine workers. MDA requires a loose bound on the variance of non-Byzantine gradient estimates, compared to existing alternatives [e.g., Krum (Blanchard et al., in: Neural information processing systems, pp 118-128, 2017)]. Interestingly, ByzSGD ensures Byzantine resilience without adding communication rounds (on a normal path), compared to vanilla non-Byzantine alternatives. ByzSGD requires, however, a larger number of messages which, we show, can be reduced if we assume synchrony. We implemented ByzSGD on top of both TensorFlow and PyTorch, and we report on our evaluation results. In particular, we show that ByzSGD guarantees convergence with around 32% overhead compared to vanilla SGD. Furthermore, we show that ByzSGD’s throughput overhead is 24–176% in the synchronous case and 28–220% in the asynchronous case.

引用

页码：305 / 331

页数：26

共 50 条

[41] Distributed machine learning in networks by consensus
Georgopoulos, Leonidas
Hasler, Martin
NEUROCOMPUTING, 2014, 124 : 2 - 12
[42] A survey of methods for distributed machine learning
Peteiro-Barral, Diego
Guijarro-Berdinas, Bertha
PROGRESS IN ARTIFICIAL INTELLIGENCE, 2013, 2 (01) : 1 - 11
[43] A new system for distributed machine learning
Wang, Jianyong
NATIONAL SCIENCE REVIEW, 2018, 5 (03) : 303 - 304
[44] Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates
Yin, Dong
Chen, Yudong
Ramchandran, Kannan
Bartlett, Peter
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
[45] Distributed machine learning, optimization and applications
Liu, Qingshan
Zeng, Zhigang
Jin, Yaochu
NEUROCOMPUTING, 2022, 489 : 486 - 487
[46] A Comparison of Distributed Machine Learning Platforms
Zhang, Kuo
Alqahtani, Salem
Demirbas, Murat
2017 26TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND NETWORKS (ICCCN 2017), 2017,
[47] Modeling Scalability of Distributed Machine Learning
Ulanov, Alexander
Simanovsky, Andrey
Marwah, Manish
2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 1249 - 1254
[48] MLI: An API for Distributed Machine Learning
Sparks, Evan R.
Talwalkar, Ameet
Smith, Virginia
Kottalam, Jey
Pan, Xinghao
Gonzalez, Joseph
Franklin, Michael J.
Jordan, Michael I.
Kraska, Tim
2013 IEEE 13TH INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2013, : 1187 - 1192
[49] Distributed secure quantum machine learning
Yu-Bo Sheng
Lan Zhou
Science Bulletin, 2017, 62 (14) : 1025 - 1029
[50] Distributed Machine Learning with a Serverless Architecture
Wang, Hao
Niu, Di
Li, Baochun
IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2019), 2019, : 1288 - 1296

← 1 2 3 4 5 →