Multiple Parallel MapReduce k-means Clustering with Validation and Selection

被引:6
|
作者
Garcia, Kemilly Dearo [1 ]
Naldi, Murilo Coelho [1 ]
机构
[1] UFV, Dept Exact & Technol Sci, Rio Paranaiba, Brazil
关键词
distributed clustering; k-means; MapReduce;
D O I
10.1109/BRACIS.2014.83
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution and management of huge data sets in separate repositories. New distributed systems have been designed to scale up from a single server to thousands of machines. The MapReduce framework allows to divide a job and combine the results seamlessly. The k-means is one of the few clustering algorithms that satisfies the MapReduce constrains, but it requires the previous specification of the number of clusters and is sensitive to their initialization. In this work, we propose a MapReduce clustering algorithm to execute multiple parallel runs of k-means with different initializations and number of clusters. Additionally, a MapReduce version of a cluster relative validity index is implemented and used to find the best result. The proposed algorithm is experimentally compared with the Apache Mahout Project's MapReduce implementation of k-means. Statistical tests applied on the results indicate that the proposed algorithm can outperform the Mahout's implementation when multiple k-means partitions are required.
引用
收藏
页码:432 / 437
页数:6
相关论文
共 50 条
  • [21] Pillar K-Means Clustering Algorithm Using MapReduce Framework
    Ramdani, A. L.
    Firmansyah, H. B.
    [J]. INTERNATIONAL CONFERENCE ON SCIENCE, INFRASTRUCTURE TECHNOLOGY AND REGIONAL DEVELOPMENT, 2019, 258
  • [22] Data decomposition for parallel K-means clustering
    Gursoy, A
    [J]. PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2004, 3019 : 241 - 248
  • [23] Efficient Parallel K-Means on MapReduce Using Triangle Inequality
    Al Ghamdi, Sami
    Di Fatta, Giuseppe
    [J]. 2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI, 2017, : 985 - 992
  • [24] Parallel Implementation of K-Means Algorithm Using MapReduce Approach
    Borlea, Ioan-Daniel
    Precup, Radu-Emil
    Dragan, Florin
    Borlea, Alexandra-Bianca
    [J]. 2018 IEEE 12TH INTERNATIONAL SYMPOSIUM ON APPLIED COMPUTATIONAL INTELLIGENCE AND INFORMATICS (SACI), 2018, : 75 - 80
  • [25] Deterministic Feature Selection for k-Means Clustering
    Boutsidis, Christos
    Magdon-Ismail, Malik
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2013, 59 (09) : 6099 - 6110
  • [26] Stability and model selection in k-means clustering
    Ohad Shamir
    Naftali Tishby
    [J]. Machine Learning, 2010, 80 : 213 - 243
  • [27] A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce
    Tang, Zhuo
    Liu, Kunkun
    Xiao, Jinbo
    Yang, Li
    Xiao, Zheng
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (20):
  • [28] A MapReduce-based parallel K-means clustering for large-scale CIM data verification
    Deng, Chuang
    Liu, Yang
    Xu, Lixiong
    Yang, Jie
    Liu, Junyong
    Li, Siguang
    Li, Maozhen
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (11): : 3096 - 3114
  • [29] Stability and model selection in k-means clustering
    Shamir, Ohad
    Tishby, Naftali
    [J]. MACHINE LEARNING, 2010, 80 (2-3) : 213 - 243
  • [30] A Variable Selection Procedure for K-Means Clustering
    Kim, Sung-Soo
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2012, 25 (03) : 471 - 483