Comparative Study of Apache Spark MLlib Clustering Algorithms

被引:6
|
作者
Harifi, Sasan [1 ]
Byagowi, Ebrahim [1 ]
Khalilian, Madjid [1 ]
机构
[1] Islamic Azad Univ, Karaj Branch, Dept Comp Engn, Karaj, Iran
来源
关键词
Clustering; k-means; Bisecting k-means; Spark MLlib; Big data; KDD cup 99; Cover type; Train time; Cohesion;
D O I
10.1007/978-3-319-61845-6_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering of big data has received much attention recently. Analytics algorithms on big datasets require tremendous computational capabilities. Apache Spark is a popular open- source platform for large-scale data processing that is well-suited for iterative machine learning tasks. This paper presents an overview of Apache Spark Machine Learning Library (Spark.MLlib) algorithms. The clustering methods consist of Gaussian Mixture Model (GMM), Power-Iteration Clustering method, Latent Dirichlet Allocation (LDA), and k-means are completely described. In this paper, three benchmark datasets include Forest Cover Type, KDD Cup 99 and Internet Advertisements used for experiments. The same algorithms that can be compared with each other, compared. For a better understanding of the results of the experiments, the algorithms are described with suitable tables and graphs.
引用
收藏
页码:61 / 73
页数:13
相关论文
共 50 条
  • [41] A Comparative Study of Divisive and Agglomerative Hierarchical Clustering Algorithms
    Maurice Roux
    [J]. Journal of Classification, 2018, 35 : 345 - 366
  • [42] Comparative Study of Clustering Algorithms in Text Mining Context
    Jalil, Abdennour Mohamed
    Hafidi, Imad
    Alami, Lamiae
    Ensa, Khouribga
    [J]. INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2016, 3 (07): : 42 - 45
  • [43] A comparative study of hard clustering algorithms for vegetation data
    Pakgohar, Naghmeh
    Rad, Javad Eshaghi
    Gholami, Gholamhossein
    Alijanpour, Ahmad
    Roberts, David W.
    [J]. JOURNAL OF VEGETATION SCIENCE, 2021, 32 (03)
  • [44] A Comparative Study of Local Search Algorithms for Correlation Clustering
    Levinkov, Evgeny
    Kirillov, Alexander
    Andres, Bjoern
    [J]. PATTERN RECOGNITION (GCPR 2017), 2017, 10496 : 103 - 114
  • [45] A Comparative Study of Some Clustering Algorithms on Shape Data
    Asili, Sahar
    Mohammadpour, Adel
    Arjmand, Omid Naghshineh
    Golalizadeh, Mousa
    [J]. JIRSS-JOURNAL OF THE IRANIAN STATISTICAL SOCIETY, 2021, 20 (02): : 29 - 42
  • [46] A Comparative Study of Divisive and Agglomerative Hierarchical Clustering Algorithms
    Roux, Maurice
    [J]. JOURNAL OF CLASSIFICATION, 2018, 35 (02) : 345 - 366
  • [47] A Case Study of Accelerating Apache Spark with FPGA
    Hou, Junjie
    Zhu, Yongxin
    Kong, Linghe
    Wang, Zhe
    Du, Sen
    Song, Shijin
    Huang, Tian
    [J]. 2018 17TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (IEEE TRUSTCOM) / 12TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (IEEE BIGDATASE), 2018, : 855 - 860
  • [48] Medical health data analysis based on Spark Mllib
    Xiao, Tong
    Shi, Yijie
    [J]. PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND ENGINEERING APPLICATIONS, 2016, 63 : 116 - 119
  • [49] Beyond linear subspace clustering: A comparative study of nonlinear manifold clustering algorithms
    Abdolali, Maryam
    Gillis, Nicolas
    [J]. COMPUTER SCIENCE REVIEW, 2021, 42
  • [50] Breast Cancer Prediction Using Spark MLlib and ML Packages
    Phan Duy Hung
    Tran Duc Hanh
    Vu Thu Diep
    [J]. ICBRA 2018: PROCEEDINGS OF 2018 5TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS RESEARCH AND APPLICATIONS, 2018, : 52 - 59