Multiple Parallel MapReduce k-means Clustering with Validation and Selection

被引：6

作者：

Garcia, Kemilly Dearo ^{[1
]}

Naldi, Murilo Coelho ^{[1
]}

机构：

[1] UFV, Dept Exact & Technol Sci, Rio Paranaiba, Brazil

来源：

2014 BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS) | 2014年

关键词：

distributed clustering; k-means; MapReduce;

D O I：

10.1109/BRACIS.2014.83

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution and management of huge data sets in separate repositories. New distributed systems have been designed to scale up from a single server to thousands of machines. The MapReduce framework allows to divide a job and combine the results seamlessly. The k-means is one of the few clustering algorithms that satisfies the MapReduce constrains, but it requires the previous specification of the number of clusters and is sensitive to their initialization. In this work, we propose a MapReduce clustering algorithm to execute multiple parallel runs of k-means with different initializations and number of clusters. Additionally, a MapReduce version of a cluster relative validity index is implemented and used to find the best result. The proposed algorithm is experimentally compared with the Apache Mahout Project's MapReduce implementation of k-means. Statistical tests applied on the results indicate that the proposed algorithm can outperform the Mahout's implementation when multiple k-means partitions are required.

引用

页码：432 / 437

页数：6

共 50 条

[21] Pillar K-Means Clustering Algorithm Using MapReduce Framework
Ramdani, A. L.
Firmansyah, H. B.
[J]. INTERNATIONAL CONFERENCE ON SCIENCE, INFRASTRUCTURE TECHNOLOGY AND REGIONAL DEVELOPMENT, 2019, 258
[22] Data decomposition for parallel K-means clustering
Gursoy, A
[J]. PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2004, 3019 : 241 - 248
[23] Efficient Parallel K-Means on MapReduce Using Triangle Inequality
Al Ghamdi, Sami
Di Fatta, Giuseppe
[J]. 2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI, 2017, : 985 - 992
[24] Parallel Implementation of K-Means Algorithm Using MapReduce Approach
Borlea, Ioan-Daniel
Precup, Radu-Emil
Dragan, Florin
Borlea, Alexandra-Bianca
[J]. 2018 IEEE 12TH INTERNATIONAL SYMPOSIUM ON APPLIED COMPUTATIONAL INTELLIGENCE AND INFORMATICS (SACI), 2018, : 75 - 80
[25] Deterministic Feature Selection for k-Means Clustering
Boutsidis, Christos
Magdon-Ismail, Malik
[J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2013, 59 (09) : 6099 - 6110
[26] Stability and model selection in k-means clustering
Ohad Shamir
Naftali Tishby
[J]. Machine Learning, 2010, 80 : 213 - 243
[27] A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce
Tang, Zhuo
Liu, Kunkun
Xiao, Jinbo
Yang, Li
Xiao, Zheng
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (20):
[28] A MapReduce-based parallel K-means clustering for large-scale CIM data verification
Deng, Chuang
Liu, Yang
Xu, Lixiong
Yang, Jie
Liu, Junyong
Li, Siguang
Li, Maozhen
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (11): : 3096 - 3114
[29] Stability and model selection in k-means clustering
Shamir, Ohad
Tishby, Naftali
[J]. MACHINE LEARNING, 2010, 80 (2-3) : 213 - 243
[30] A Variable Selection Procedure for K-Means Clustering
Kim, Sung-Soo
[J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2012, 25 (03) : 471 - 483

← 1 2 3 4 5 →