Model averaging in distributed machine learning: a case study with Apache Spark

被引:6
|
作者
Guo, Yunyan [1 ]
Zhang, Zhipeng [2 ]
Jiang, Jiawei [4 ]
Wu, Wentao [3 ]
Zhang, Ce [4 ]
Cui, Bin [2 ]
Li, Jianzhong [1 ]
机构
[1] Harbin Inst Technol, Mass Data Comp Res Ctr, Harbin 150001, Peoples R China
[2] Peking Univ, Sch EECS, Beijing 100871, Peoples R China
[3] Microsoft Res, Redmond, WA USA
[4] Swiss Fed Inst Technol, Dept Comp Sci, CH-8092 Zurich, Switzerland
来源
VLDB JOURNAL | 2021年 / 30卷 / 04期
基金
中国国家自然科学基金;
关键词
Distributed machine learning; Apache Spark MLlib; Generalized linear models; Latent Dirichlet allocation;
D O I
10.1007/s00778-021-00664-7
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The increasing popularity of Apache Spark has attracted many users to put their data into its ecosystem. On the other hand, it has been witnessed in the literature that Spark is slow when it comes to distributed machine learning (ML). One resort is to switch to specialized systems such as parameter servers, which are claimed to have better performance. Nonetheless, users have to undergo the painful procedure of moving data into and out of Spark. In this paper, we investigate performance bottlenecks of MLlib (an official Spark package for ML) in detail, by focusing on analyzing its implementation of stochastic gradient descent (SGD)-the workhorse under the training of many ML models. We show that the performance inferiority of Spark is caused by implementation issues rather than fundamental flaws of the bulk synchronous parallel (BSP) model that governs Spark's execution: we can significantly improve Spark's performance by leveraging the well-known "model averaging" (MA) technique in distributed ML. Indeed, model averaging is not limited to SGD, and we further showcase an application of MA to training latent Dirichlet allocation (LDA) models within Spark. Our implementation is not intrusive and requires light development effort. Experimental evaluation results reveal that the MA-based versions of SGD and LDA can be orders of magnitude faster compared to their counterparts without using MA.
引用
收藏
页码:693 / 712
页数:20
相关论文
共 50 条
  • [1] Model averaging in distributed machine learning: a case study with Apache Spark
    Yunyan Guo
    Zhipeng Zhang
    Jiawei Jiang
    Wentao Wu
    Ce Zhang
    Bin Cui
    Jianzhong Li
    [J]. The VLDB Journal, 2021, 30 : 693 - 712
  • [2] On Scalability of Distributed Machine Learning with Big Data on Apache Spark
    Hai, Ameen Abdel
    Forouraghi, Babak
    [J]. BIG DATA - BIGDATA 2018, 2018, 10968 : 209 - 219
  • [3] Predicting Diabetes using Distributed Machine Learning based on Apache Spark
    Ahmed, Hager
    Younis, Eman M. G.
    Ali, Abdelmgeid A.
    [J]. PROCEEDINGS OF 2020 INTERNATIONAL CONFERENCE ON INNOVATIVE TRENDS IN COMMUNICATION AND COMPUTER ENGINEERING (ITCE), 2020, : 44 - 49
  • [4] Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark
    Dunner, Celestine
    Parnell, Thomas
    Atasu, Kubilay
    Sifalakis, Manolis
    Pozidis, Haralampos
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 331 - 338
  • [5] MLlib: Machine learning in Apache Spark
    Meng, Xiangrui
    Bradley, Joseph
    Yavuz, Burak
    Sparks, Evan
    Venkataraman, Shivaram
    Liu, Davies
    Freeman, Jeremy
    Tsai, D.B.
    Amde, Manish
    Owen, Sean
    Xin, Doris
    Xin, Reynold
    Franklin, Michael J.
    Zadeh, Reza
    Zaharia, Matei
    Talwalkar, Ameet
    [J]. Journal of Machine Learning Research, 2016, 17
  • [6] MLlib: Machine Learning in Apache Spark
    Meng, Xiangrui
    Bradley, Joseph
    Yavuz, Burak
    Sparks, Evan
    Venkataraman, Shivaram
    Liu, Davies
    Freeman, Jeremy
    Tsai, D. B.
    Amde, Manish
    Owen, Sean
    Xin, Doris
    Xin, Reynold
    Franklin, Michael J.
    Zadeh, Reza
    Zaharia, Matei
    Talwalkar, Ameet
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
  • [7] Characterizing Distributed Machine Learning Workloads on Apache Spark (Experimentation and Deployment Paper)
    Djebrouni, Yasmine
    Rocha, Isabelly
    Bouchenak, Sara
    Chen, Lydia
    Felber, Pascal
    Marangozova, Vania
    Schiavoni, Valerio
    [J]. PROCEEDINGS OF THE 24TH ACM/IFIP INTERNATIONAL MIDDLEWARE CONFERENCE, MIDDLEWARE 2023, 2023, : 151 - 164
  • [8] Towards Distributed Model Analytics with Apache Spark
    Babur, Onder
    Cleophas, Loek
    van den Brand, Mark
    [J]. PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON MODEL-DRIVEN ENGINEERING AND SOFTWARE DEVELOPMENT, 2018, : 767 - 772
  • [9] Privacy-Preserving Machine Learning on Apache Spark
    Brito, Claudia V.
    Ferreira, Pedro G.
    Portela, Bernardo L.
    Oliveira, Rui C.
    Paulo, Joao T.
    [J]. IEEE ACCESS, 2023, 11 : 127907 - 127930
  • [10] Optimizing Machine Learning on Apache Spark in HPC Environments
    Li, Zhenyu
    Davis, James
    Jarvis, Stephen A.
    [J]. PROCEEDINGS OF 2018 IEEE/ACM MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC 2018), 2018, : 95 - 105