Multiple Parallel MapReduce k-means Clustering with Validation and Selection

被引:6
|
作者
Garcia, Kemilly Dearo [1 ]
Naldi, Murilo Coelho [1 ]
机构
[1] UFV, Dept Exact & Technol Sci, Rio Paranaiba, Brazil
关键词
distributed clustering; k-means; MapReduce;
D O I
10.1109/BRACIS.2014.83
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution and management of huge data sets in separate repositories. New distributed systems have been designed to scale up from a single server to thousands of machines. The MapReduce framework allows to divide a job and combine the results seamlessly. The k-means is one of the few clustering algorithms that satisfies the MapReduce constrains, but it requires the previous specification of the number of clusters and is sensitive to their initialization. In this work, we propose a MapReduce clustering algorithm to execute multiple parallel runs of k-means with different initializations and number of clusters. Additionally, a MapReduce version of a cluster relative validity index is implemented and used to find the best result. The proposed algorithm is experimentally compared with the Apache Mahout Project's MapReduce implementation of k-means. Statistical tests applied on the results indicate that the proposed algorithm can outperform the Mahout's implementation when multiple k-means partitions are required.
引用
收藏
页码:432 / 437
页数:6
相关论文
共 50 条
  • [1] Parallel K-Means Clustering Based on MapReduce
    Zhao, Weizhong
    Ma, Huifang
    He, Qing
    [J]. CLOUD COMPUTING, PROCEEDINGS, 2009, 5931 : 674 - 679
  • [2] An Improved parallel K-means Clustering Algorithm with MapReduce
    Liao, Qing
    Yang, Fan
    Zhao, Jingming
    [J]. 2013 15TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT), 2013, : 764 - 768
  • [3] Parallel K-Means Clustering of Remote Sensing Images Based on MapReduce
    Lv, Zhenhua
    Hu, Yingjie
    Zhong, Haidong
    Wu, Jianping
    Li, Bo
    Zhao, Hui
    [J]. WEB INFORMATION SYSTEMS AND MINING, 2010, 6318 : 162 - +
  • [4] MapReduce Design of K-Means Clustering Algorithm
    Anchalia, Prajesh P.
    Koundinya, Anjan K.
    Srinath, N. K.
    [J]. 2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND APPLICATIONS (ICISA 2013), 2013,
  • [5] A Novel MapReduce Based k-Means Clustering
    Sinha, Ankita
    Jana, Prasanta K.
    [J]. PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND COMMUNICATION, 2017, 458 : 247 - 255
  • [6] An Efficient K-means Clustering Algorithm on MapReduce
    Li, Qiuhong
    Wang, Peng
    Wang, Wei
    Hu, Hao
    Li, Zhongsheng
    Li, Junxian
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT I, 2014, 8421 : 357 - 371
  • [7] Optimisation Techniques for Parallel K-Means on MapReduce
    Al Ghamdi, Sami
    Di Fatta, Giuseppe
    Stahl, Frederic
    [J]. INTERNET AND DISTRIBUTED COMPUTING SYSTEMS, IDCS 2015, 2015, 9258 : 193 - 200
  • [8] Selection of K in K-means clustering
    Pham, DT
    Dimov, SS
    Nguyen, CD
    [J]. PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART C-JOURNAL OF MECHANICAL ENGINEERING SCIENCE, 2005, 219 (01) : 103 - 119
  • [9] Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering
    Ansari Z.
    Afzal A.
    Sardar T.H.
    [J]. Journal of The Institution of Engineers (India): Series B, 2019, 100 (2) : 95 - 103
  • [10] K-Means Parallel Algorithm of Big Data Clustering Based on Mapreduce PCAM Method
    Li, Yongyi
    Yang, Zhongqiang
    Han, Kaixu
    [J]. Engineering Intelligent Systems, 2021, 29 (06): : 411 - 418