Model averaging in distributed machine learning: a case study with Apache Spark

被引：6

作者：

Guo, Yunyan ^{[1
]}

Zhang, Zhipeng ^{[2
]}

Jiang, Jiawei ^{[4
]}

Wu, Wentao ^{[3
]}

Zhang, Ce ^{[4
]}

Cui, Bin ^{[2
]}

Li, Jianzhong ^{[1
]}

机构：

[1] Harbin Inst Technol, Mass Data Comp Res Ctr, Harbin 150001, Peoples R China

[2] Peking Univ, Sch EECS, Beijing 100871, Peoples R China

[3] Microsoft Res, Redmond, WA USA

[4] Swiss Fed Inst Technol, Dept Comp Sci, CH-8092 Zurich, Switzerland

来源：

VLDB JOURNAL | 2021年 / 30卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Distributed machine learning; Apache Spark MLlib; Generalized linear models; Latent Dirichlet allocation;

D O I：

10.1007/s00778-021-00664-7

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The increasing popularity of Apache Spark has attracted many users to put their data into its ecosystem. On the other hand, it has been witnessed in the literature that Spark is slow when it comes to distributed machine learning (ML). One resort is to switch to specialized systems such as parameter servers, which are claimed to have better performance. Nonetheless, users have to undergo the painful procedure of moving data into and out of Spark. In this paper, we investigate performance bottlenecks of MLlib (an official Spark package for ML) in detail, by focusing on analyzing its implementation of stochastic gradient descent (SGD)-the workhorse under the training of many ML models. We show that the performance inferiority of Spark is caused by implementation issues rather than fundamental flaws of the bulk synchronous parallel (BSP) model that governs Spark's execution: we can significantly improve Spark's performance by leveraging the well-known "model averaging" (MA) technique in distributed ML. Indeed, model averaging is not limited to SGD, and we further showcase an application of MA to training latent Dirichlet allocation (LDA) models within Spark. Our implementation is not intrusive and requires light development effort. Experimental evaluation results reveal that the MA-based versions of SGD and LDA can be orders of magnitude faster compared to their counterparts without using MA.

引用

页码：693 / 712

页数：20

共 50 条

[1] Model averaging in distributed machine learning: a case study with Apache Spark
Yunyan Guo
Zhipeng Zhang
Jiawei Jiang
Wentao Wu
Ce Zhang
Bin Cui
Jianzhong Li
[J]. The VLDB Journal, 2021, 30 : 693 - 712
[2] On Scalability of Distributed Machine Learning with Big Data on Apache Spark
Hai, Ameen Abdel
Forouraghi, Babak
[J]. BIG DATA - BIGDATA 2018, 2018, 10968 : 209 - 219
[3] Predicting Diabetes using Distributed Machine Learning based on Apache Spark
Ahmed, Hager
Younis, Eman M. G.
Ali, Abdelmgeid A.
[J]. PROCEEDINGS OF 2020 INTERNATIONAL CONFERENCE ON INNOVATIVE TRENDS IN COMMUNICATION AND COMPUTER ENGINEERING (ITCE), 2020, : 44 - 49
[4] Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark
Dunner, Celestine
Parnell, Thomas
Atasu, Kubilay
Sifalakis, Manolis
Pozidis, Haralampos
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 331 - 338
[5] MLlib: Machine learning in Apache Spark
Meng, Xiangrui
Bradley, Joseph
Yavuz, Burak
Sparks, Evan
Venkataraman, Shivaram
Liu, Davies
Freeman, Jeremy
Tsai, D.B.
Amde, Manish
Owen, Sean
Xin, Doris
Xin, Reynold
Franklin, Michael J.
Zadeh, Reza
Zaharia, Matei
Talwalkar, Ameet
[J]. Journal of Machine Learning Research, 2016, 17
[6] MLlib: Machine Learning in Apache Spark
Meng, Xiangrui
Bradley, Joseph
Yavuz, Burak
Sparks, Evan
Venkataraman, Shivaram
Liu, Davies
Freeman, Jeremy
Tsai, D. B.
Amde, Manish
Owen, Sean
Xin, Doris
Xin, Reynold
Franklin, Michael J.
Zadeh, Reza
Zaharia, Matei
Talwalkar, Ameet
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
[7] Characterizing Distributed Machine Learning Workloads on Apache Spark (Experimentation and Deployment Paper)
Djebrouni, Yasmine
Rocha, Isabelly
Bouchenak, Sara
Chen, Lydia
Felber, Pascal
Marangozova, Vania
Schiavoni, Valerio
[J]. PROCEEDINGS OF THE 24TH ACM/IFIP INTERNATIONAL MIDDLEWARE CONFERENCE, MIDDLEWARE 2023, 2023, : 151 - 164
[8] Towards Distributed Model Analytics with Apache Spark
Babur, Onder
Cleophas, Loek
van den Brand, Mark
[J]. PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON MODEL-DRIVEN ENGINEERING AND SOFTWARE DEVELOPMENT, 2018, : 767 - 772
[9] Privacy-Preserving Machine Learning on Apache Spark
Brito, Claudia V.
Ferreira, Pedro G.
Portela, Bernardo L.
Oliveira, Rui C.
Paulo, Joao T.
[J]. IEEE ACCESS, 2023, 11 : 127907 - 127930
[10] Optimizing Machine Learning on Apache Spark in HPC Environments
Li, Zhenyu
Davis, James
Jarvis, Stephen A.
[J]. PROCEEDINGS OF 2018 IEEE/ACM MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC 2018), 2018, : 95 - 105

← 1 2 3 4 5 →