Comparative Study of Apache Spark MLlib Clustering Algorithms

被引:6
|
作者
Harifi, Sasan [1 ]
Byagowi, Ebrahim [1 ]
Khalilian, Madjid [1 ]
机构
[1] Islamic Azad Univ, Karaj Branch, Dept Comp Engn, Karaj, Iran
来源
关键词
Clustering; k-means; Bisecting k-means; Spark MLlib; Big data; KDD cup 99; Cover type; Train time; Cohesion;
D O I
10.1007/978-3-319-61845-6_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering of big data has received much attention recently. Analytics algorithms on big datasets require tremendous computational capabilities. Apache Spark is a popular open- source platform for large-scale data processing that is well-suited for iterative machine learning tasks. This paper presents an overview of Apache Spark Machine Learning Library (Spark.MLlib) algorithms. The clustering methods consist of Gaussian Mixture Model (GMM), Power-Iteration Clustering method, Latent Dirichlet Allocation (LDA), and k-means are completely described. In this paper, three benchmark datasets include Forest Cover Type, KDD Cup 99 and Internet Advertisements used for experiments. The same algorithms that can be compared with each other, compared. For a better understanding of the results of the experiments, the algorithms are described with suitable tables and graphs.
引用
收藏
页码:61 / 73
页数:13
相关论文
共 50 条
  • [1] MLlib: Machine learning in Apache Spark
    Meng, Xiangrui
    Bradley, Joseph
    Yavuz, Burak
    Sparks, Evan
    Venkataraman, Shivaram
    Liu, Davies
    Freeman, Jeremy
    Tsai, D.B.
    Amde, Manish
    Owen, Sean
    Xin, Doris
    Xin, Reynold
    Franklin, Michael J.
    Zadeh, Reza
    Zaharia, Matei
    Talwalkar, Ameet
    [J]. Journal of Machine Learning Research, 2016, 17
  • [2] MLlib: Machine Learning in Apache Spark
    Meng, Xiangrui
    Bradley, Joseph
    Yavuz, Burak
    Sparks, Evan
    Venkataraman, Shivaram
    Liu, Davies
    Freeman, Jeremy
    Tsai, D. B.
    Amde, Manish
    Owen, Sean
    Xin, Doris
    Xin, Reynold
    Franklin, Michael J.
    Zadeh, Reza
    Zaharia, Matei
    Talwalkar, Ameet
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
  • [3] Predicting Potential Banking Customer Churn using Apache Spark ML and MLlib Packages: A Comparative Study
    Sayed, Hend
    Abdel-Fattah, Manal A.
    Kholief, Sherif
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (11) : 674 - 677
  • [4] Big Data Machine Learning using Apache Spark MLlib
    Assefi, Mehdi
    Behravesh, Ehsun
    Liu, Guangchi
    Tafti, Ahmad P.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 3492 - 3498
  • [5] Analyzing Performance of Apache Spark MLlib with Multinode Clusters on Azure HDInsight: Spark-Perf Case Study
    Minukhin, Sergii
    Brynza, Natalia
    Sitnikov, Dmytro
    [J]. LECTURE NOTES IN COMPUTATIONAL INTELLIGENCE AND DECISION MAKING (ISDMCI 2020), 2020, 1246 : 114 - 134
  • [6] Fuzzy Based Clustering Algorithms to Handle Big Data with Implementation on Apache Spark
    Bharill, Neha
    Tiwari, Aruna
    Malviya, Aayushi
    [J]. PROCEEDINGS 2016 IEEE SECOND INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2016), 2016, : 95 - 104
  • [7] MLlib*: Fast Training of GLMs using Spark MLlib
    Zhang, Zhipeng
    Jiang, Jiawei
    Wu, Wentao
    Zhang, Ce
    Yu, Lele
    Cui, Bin
    [J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 1778 - 1789
  • [8] Performance evaluation of DNN with other machine learning techniques in a cluster using Apache Spark and MLlib
    JayaLakshmi, A. N. M.
    Kishore, K. V. Krishna
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (01) : 1311 - 1319
  • [9] An Apache Spark Implementation for Text Document Clustering
    Dritsas, Elias
    Trigka, Maria
    Vonitsanos, Gerasimos
    Kanavos, Andreas
    Mylonas, Phivos
    [J]. 2022 17TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION & PERSONALIZATION (SMAP 2022), 2022, : 50 - 55
  • [10] Scalable Implementation of Dependence Clustering in Apache Spark
    Ivannikova, Elena
    [J]. PROCEEDINGS OF THE 2017 EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS (EAIS), 2017,