Efficient Parallel K-Means on MapReduce Using Triangle Inequality

被引:2
|
作者
Al Ghamdi, Sami [1 ]
Di Fatta, Giuseppe [1 ]
机构
[1] Univ Reading, Dept Comp Sci, Reading RG6 6AY, Berks, England
关键词
D O I
10.1109/DASC-PICom-DataCom-CyberSciTec.2017.163
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
K-Means is one of the most efficient and popular clustering algorithms that has been around for more than 50 years. The naive implementation of K-Means spends the vast majority of its time computing redundant distance calculations from each point to all cluster centres. This issue has been extensively studied and methods based on the triangle inequality principle have been used to eliminate unnecessary distance calculations. Most triangle inequality optimisations cache extra information (distance bounds and cluster assignments) from one iteration to eliminate the need of computing exact distances in the next. This work takes these optimisations one step further and integrates them into an accelerated version of K-Means on a well-known distributed computing framework known as MapReduce to produce an efficient and highly scalable K-Means for big data. Although MapReduce is considered as one of the most reliable and fault tolerant distributed computing frameworks, one of its major drawback is that it does not support iterative algorithms such as K-Means, and does not cache any data between two consecutive iterations, which is required in most triangle inequality optimisations. Therefore, this work introduces two new approaches to pass information from one iteration to the next to accelerate K-Means. The first approach is called K-Means on MapReduce using Extended Vector (KMMR-EV). The second approach is called K-Means on MapReduce using Bounds Files (KMMR-BF). These approaches achieve speedups up to 4.5x for KMMR-EV and 6.8x for KMMR-BF, with respect to the naive implementation of K-Means on MapReduce (KMMR-N). An extensive experimental work, with real and synthetic datasets, has been conducted on Apache Hadoop (an open-source implementation of MapReduce), along with an overhead analysis to show the effectiveness of both approaches.
引用
收藏
页码:985 / 992
页数:8
相关论文
共 50 条
  • [1] Analyzing Digital Evidence Using Parallel k-means with Triangle Inequality on Spark
    Chitrakar, Ambika Shrestha
    Petrovic, Slobodan
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 3049 - 3058
  • [2] Efficient k-means Using Triangle Inequality on Spark for Cyber Security Analytics
    Chitrakar, Ambika Shrestha
    Petrovic, Slobodan
    [J]. PROCEEDINGS OF THE ACM INTERNATIONAL WORKSHOP ON SECURITY AND PRIVACY ANALYTICS (IWSPA '19), 2019, : 37 - 45
  • [3] Parallel Implementation of K-Means Algorithm Using MapReduce Approach
    Borlea, Ioan-Daniel
    Precup, Radu-Emil
    Dragan, Florin
    Borlea, Alexandra-Bianca
    [J]. 2018 IEEE 12TH INTERNATIONAL SYMPOSIUM ON APPLIED COMPUTATIONAL INTELLIGENCE AND INFORMATICS (SACI), 2018, : 75 - 80
  • [4] Optimisation Techniques for Parallel K-Means on MapReduce
    Al Ghamdi, Sami
    Di Fatta, Giuseppe
    Stahl, Frederic
    [J]. INTERNET AND DISTRIBUTED COMPUTING SYSTEMS, IDCS 2015, 2015, 9258 : 193 - 200
  • [5] Parallel K-Means Clustering Based on MapReduce
    Zhao, Weizhong
    Ma, Huifang
    He, Qing
    [J]. CLOUD COMPUTING, PROCEEDINGS, 2009, 5931 : 674 - 679
  • [6] An Efficient K-means Clustering Algorithm on MapReduce
    Li, Qiuhong
    Wang, Peng
    Wang, Wei
    Hu, Hao
    Li, Zhongsheng
    Li, Junxian
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT I, 2014, 8421 : 357 - 371
  • [7] An Improved parallel K-means Clustering Algorithm with MapReduce
    Liao, Qing
    Yang, Fan
    Zhao, Jingming
    [J]. 2013 15TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT), 2013, : 764 - 768
  • [8] An Effective and Efficient Clustering Based on K-Means Using MapReduce and TLBO
    Pedireddla, Praveen Kumar
    Yadwad, Sunita A.
    [J]. PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION TECHNOLOGIES, IC3T 2015, VOL 3, 2016, 381 : 619 - 628
  • [9] KPynq: A Work-Efficient Triangle-Inequality based K-means on FPGA
    Wang, Yuke
    Zeng, Zhaorui
    Feng, Boyuan
    Deng, Lei
    Ding, Yufei
    [J]. 2019 27TH IEEE ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2019, : 320 - 320
  • [10] Efficient k-Means plus plus Approximation with MapReduce
    Xu, Yujie
    Qu, Wenyu
    Li, Zhiyang
    Min, Geyong
    Li, Keqiu
    Liu, Zhaobin
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (12) : 3135 - 3144