Efficient Parallel K-Means on MapReduce Using Triangle Inequality

被引:2
|
作者
Al Ghamdi, Sami [1 ]
Di Fatta, Giuseppe [1 ]
机构
[1] Univ Reading, Dept Comp Sci, Reading RG6 6AY, Berks, England
关键词
D O I
10.1109/DASC-PICom-DataCom-CyberSciTec.2017.163
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
K-Means is one of the most efficient and popular clustering algorithms that has been around for more than 50 years. The naive implementation of K-Means spends the vast majority of its time computing redundant distance calculations from each point to all cluster centres. This issue has been extensively studied and methods based on the triangle inequality principle have been used to eliminate unnecessary distance calculations. Most triangle inequality optimisations cache extra information (distance bounds and cluster assignments) from one iteration to eliminate the need of computing exact distances in the next. This work takes these optimisations one step further and integrates them into an accelerated version of K-Means on a well-known distributed computing framework known as MapReduce to produce an efficient and highly scalable K-Means for big data. Although MapReduce is considered as one of the most reliable and fault tolerant distributed computing frameworks, one of its major drawback is that it does not support iterative algorithms such as K-Means, and does not cache any data between two consecutive iterations, which is required in most triangle inequality optimisations. Therefore, this work introduces two new approaches to pass information from one iteration to the next to accelerate K-Means. The first approach is called K-Means on MapReduce using Extended Vector (KMMR-EV). The second approach is called K-Means on MapReduce using Bounds Files (KMMR-BF). These approaches achieve speedups up to 4.5x for KMMR-EV and 6.8x for KMMR-BF, with respect to the naive implementation of K-Means on MapReduce (KMMR-N). An extensive experimental work, with real and synthetic datasets, has been conducted on Apache Hadoop (an open-source implementation of MapReduce), along with an overhead analysis to show the effectiveness of both approaches.
引用
收藏
页码:985 / 992
页数:8
相关论文
共 50 条
  • [31] An extended K-Means algorithm using mapreduce framework for mixed datasets
    Chadha, Anupama
    Kumar, Suresh
    [J]. International Journal of Database Theory and Application, 2016, 9 (09): : 167 - 176
  • [32] A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality
    Wang, Xueyi
    [J]. 2011 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2011, : 1293 - 1299
  • [33] Parallel BVH construction using k-means clustering
    Daniel Meister
    Jiří Bittner
    [J]. The Visual Computer, 2016, 32 : 977 - 987
  • [34] Parallel BVH construction using k-means clustering
    Meister, Daniel
    Bittner, Jiri
    [J]. VISUAL COMPUTER, 2016, 32 (6-8): : 977 - 987
  • [35] An Improved approach for K-Means using Parallel Processing
    Swamy, Prateek
    Raghuwanshi, M. M.
    Gholghate, Ashish
    [J]. 1ST INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION ICCUBEA 2015, 2015, : 358 - 361
  • [36] Parallel Processing Of Enhanced K-Means Using OpenMP
    Naik, D. S. Bhupal
    Kumar, S. Deva
    Ramakrishna, S. V.
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2013, : 685 - 688
  • [37] A MapReduce framework to implement Enhanced K-means algorithm
    Purohit, Bhimasen. V.
    Shettar, Rajashree
    [J]. PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON APPLIED AND THEORETICAL COMPUTING AND COMMUNICATION TECHNOLOGY (ICATCCT), 2015, : 361 - 363
  • [38] K-Means算法的MapReduce并行实现
    蒋溢
    刘鑫洋
    [J]. 西南大学学报(自然科学版), 2016, (11) : 180 - 185
  • [39] An Improved K-means Algorithm based on Mapreduce and Grid
    Ma, Li
    Gu, Lei
    Li, Bo
    Ma, Yue
    Wang, Jin
    [J]. INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2015, 8 (01): : 189 - 199
  • [40] K-means Clustering Optimization Algorithm Based on MapReduce
    Li, Zhihua
    Song, Xudong
    Zhu, Wenhui
    Chen, Yanxia
    [J]. PROCEEDINGS OF THE 2015 INTERNATIONAL SYMPOSIUM ON COMPUTERS & INFORMATICS, 2015, 13 : 198 - 203