Efficient Parallel K-Means on MapReduce Using Triangle Inequality

被引：2

作者：

Al Ghamdi, Sami ^{[1
]}

Di Fatta, Giuseppe ^{[1
]}

机构：

[1] Univ Reading, Dept Comp Sci, Reading RG6 6AY, Berks, England

来源：

2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI | 2017年

关键词：

D O I：

10.1109/DASC-PICom-DataCom-CyberSciTec.2017.163

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

K-Means is one of the most efficient and popular clustering algorithms that has been around for more than 50 years. The naive implementation of K-Means spends the vast majority of its time computing redundant distance calculations from each point to all cluster centres. This issue has been extensively studied and methods based on the triangle inequality principle have been used to eliminate unnecessary distance calculations. Most triangle inequality optimisations cache extra information (distance bounds and cluster assignments) from one iteration to eliminate the need of computing exact distances in the next. This work takes these optimisations one step further and integrates them into an accelerated version of K-Means on a well-known distributed computing framework known as MapReduce to produce an efficient and highly scalable K-Means for big data. Although MapReduce is considered as one of the most reliable and fault tolerant distributed computing frameworks, one of its major drawback is that it does not support iterative algorithms such as K-Means, and does not cache any data between two consecutive iterations, which is required in most triangle inequality optimisations. Therefore, this work introduces two new approaches to pass information from one iteration to the next to accelerate K-Means. The first approach is called K-Means on MapReduce using Extended Vector (KMMR-EV). The second approach is called K-Means on MapReduce using Bounds Files (KMMR-BF). These approaches achieve speedups up to 4.5x for KMMR-EV and 6.8x for KMMR-BF, with respect to the naive implementation of K-Means on MapReduce (KMMR-N). An extensive experimental work, with real and synthetic datasets, has been conducted on Apache Hadoop (an open-source implementation of MapReduce), along with an overhead analysis to show the effectiveness of both approaches.

引用

页码：985 / 992

页数：8

共 50 条

[1] Analyzing Digital Evidence Using Parallel k-means with Triangle Inequality on Spark
Chitrakar, Ambika Shrestha
Petrovic, Slobodan
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 3049 - 3058
[2] Efficient k-means Using Triangle Inequality on Spark for Cyber Security Analytics
Chitrakar, Ambika Shrestha
Petrovic, Slobodan
[J]. PROCEEDINGS OF THE ACM INTERNATIONAL WORKSHOP ON SECURITY AND PRIVACY ANALYTICS (IWSPA '19), 2019, : 37 - 45
[3] Parallel Implementation of K-Means Algorithm Using MapReduce Approach
Borlea, Ioan-Daniel
Precup, Radu-Emil
Dragan, Florin
Borlea, Alexandra-Bianca
[J]. 2018 IEEE 12TH INTERNATIONAL SYMPOSIUM ON APPLIED COMPUTATIONAL INTELLIGENCE AND INFORMATICS (SACI), 2018, : 75 - 80
[4] Optimisation Techniques for Parallel K-Means on MapReduce
Al Ghamdi, Sami
Di Fatta, Giuseppe
Stahl, Frederic
[J]. INTERNET AND DISTRIBUTED COMPUTING SYSTEMS, IDCS 2015, 2015, 9258 : 193 - 200
[5] Parallel K-Means Clustering Based on MapReduce
Zhao, Weizhong
Ma, Huifang
He, Qing
[J]. CLOUD COMPUTING, PROCEEDINGS, 2009, 5931 : 674 - 679
[6] An Efficient K-means Clustering Algorithm on MapReduce
Li, Qiuhong
Wang, Peng
Wang, Wei
Hu, Hao
Li, Zhongsheng
Li, Junxian
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT I, 2014, 8421 : 357 - 371
[7] An Improved parallel K-means Clustering Algorithm with MapReduce
Liao, Qing
Yang, Fan
Zhao, Jingming
[J]. 2013 15TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT), 2013, : 764 - 768
[8] An Effective and Efficient Clustering Based on K-Means Using MapReduce and TLBO
Pedireddla, Praveen Kumar
Yadwad, Sunita A.
[J]. PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION TECHNOLOGIES, IC3T 2015, VOL 3, 2016, 381 : 619 - 628
[9] KPynq: A Work-Efficient Triangle-Inequality based K-means on FPGA
Wang, Yuke
Zeng, Zhaorui
Feng, Boyuan
Deng, Lei
Ding, Yufei
[J]. 2019 27TH IEEE ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2019, : 320 - 320
[10] Efficient k-Means plus plus Approximation with MapReduce
Xu, Yujie
Qu, Wenyu
Li, Zhiyang
Min, Geyong
Li, Keqiu
Liu, Zhaobin
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (12) : 3135 - 3144

← 1 2 3 4 5 →