Comparative Study of Apache Spark MLlib Clustering Algorithms

被引：6

作者：

Harifi, Sasan ^{[1
]}

Byagowi, Ebrahim ^{[1
]}

Khalilian, Madjid ^{[1
]}

机构：

[1] Islamic Azad Univ, Karaj Branch, Dept Comp Engn, Karaj, Iran

来源：

DATA MINING AND BIG DATA, DMBD 2017 | 2017年 / 10387卷

关键词：

Clustering; k-means; Bisecting k-means; Spark MLlib; Big data; KDD cup 99; Cover type; Train time; Cohesion;

D O I：

10.1007/978-3-319-61845-6_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Clustering of big data has received much attention recently. Analytics algorithms on big datasets require tremendous computational capabilities. Apache Spark is a popular open- source platform for large-scale data processing that is well-suited for iterative machine learning tasks. This paper presents an overview of Apache Spark Machine Learning Library (Spark.MLlib) algorithms. The clustering methods consist of Gaussian Mixture Model (GMM), Power-Iteration Clustering method, Latent Dirichlet Allocation (LDA), and k-means are completely described. In this paper, three benchmark datasets include Forest Cover Type, KDD Cup 99 and Internet Advertisements used for experiments. The same algorithms that can be compared with each other, compared. For a better understanding of the results of the experiments, the algorithms are described with suitable tables and graphs.

引用

页码：61 / 73

页数：13

共 50 条

[1] MLlib: Machine learning in Apache Spark
Meng, Xiangrui
Bradley, Joseph
Yavuz, Burak
Sparks, Evan
Venkataraman, Shivaram
Liu, Davies
Freeman, Jeremy
Tsai, D.B.
Amde, Manish
Owen, Sean
Xin, Doris
Xin, Reynold
Franklin, Michael J.
Zadeh, Reza
Zaharia, Matei
Talwalkar, Ameet
[J]. Journal of Machine Learning Research, 2016, 17
[2] MLlib: Machine Learning in Apache Spark
Meng, Xiangrui
Bradley, Joseph
Yavuz, Burak
Sparks, Evan
Venkataraman, Shivaram
Liu, Davies
Freeman, Jeremy
Tsai, D. B.
Amde, Manish
Owen, Sean
Xin, Doris
Xin, Reynold
Franklin, Michael J.
Zadeh, Reza
Zaharia, Matei
Talwalkar, Ameet
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
[3] Predicting Potential Banking Customer Churn using Apache Spark ML and MLlib Packages: A Comparative Study
Sayed, Hend
Abdel-Fattah, Manal A.
Kholief, Sherif
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (11) : 674 - 677
[4] Big Data Machine Learning using Apache Spark MLlib
Assefi, Mehdi
Behravesh, Ehsun
Liu, Guangchi
Tafti, Ahmad P.
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 3492 - 3498
[5] Analyzing Performance of Apache Spark MLlib with Multinode Clusters on Azure HDInsight: Spark-Perf Case Study
Minukhin, Sergii
Brynza, Natalia
Sitnikov, Dmytro
[J]. LECTURE NOTES IN COMPUTATIONAL INTELLIGENCE AND DECISION MAKING (ISDMCI 2020), 2020, 1246 : 114 - 134
[6] Fuzzy Based Clustering Algorithms to Handle Big Data with Implementation on Apache Spark
Bharill, Neha
Tiwari, Aruna
Malviya, Aayushi
[J]. PROCEEDINGS 2016 IEEE SECOND INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2016), 2016, : 95 - 104
[7] MLlib*: Fast Training of GLMs using Spark MLlib
Zhang, Zhipeng
Jiang, Jiawei
Wu, Wentao
Zhang, Ce
Yu, Lele
Cui, Bin
[J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 1778 - 1789
[8] Performance evaluation of DNN with other machine learning techniques in a cluster using Apache Spark and MLlib
JayaLakshmi, A. N. M.
Kishore, K. V. Krishna
[J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (01) : 1311 - 1319
[9] An Apache Spark Implementation for Text Document Clustering
Dritsas, Elias
Trigka, Maria
Vonitsanos, Gerasimos
Kanavos, Andreas
Mylonas, Phivos
[J]. 2022 17TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION & PERSONALIZATION (SMAP 2022), 2022, : 50 - 55
[10] Scalable Implementation of Dependence Clustering in Apache Spark
Ivannikova, Elena
[J]. PROCEEDINGS OF THE 2017 EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS (EAIS), 2017,

← 1 2 3 4 5 →