A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引：0

作者：

Ankita Sinha

Prasanta K. Jana

机构：

[1] IIT (ISM),Department of Computer Science and Engineering

[2] Dhanbad,undefined

来源：

The Journal of Supercomputing | 2018年 / 74卷

关键词：

Mahalanobis distance; Apache Hadoop; -means++ initialization; Genetic algorithm;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means++\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ ++ $$\end{document} initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.

引用

页码：1562 / 1579

页数：17

共 50 条

[1] A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets
Sinha, Ankita
Jana, Prasanta K.
JOURNAL OF SUPERCOMPUTING, 2018, 74 (04): : 1562 - 1579
[2] A MapReduce-based K-means clustering algorithm
YiMin Mao
DeJin Gan
D. S. Mwakapesa
Y. A. Nanehkaran
Tao Tao
XueYu Huang
The Journal of Supercomputing, 2022, 78 : 5181 - 5202
[3] A MapReduce-based K-means clustering algorithm
Mao, YiMin
Gan, DeJin
Mwakapesa, D. S.
Nanehkaran, Y. A.
Tao, Tao
Huang, XueYu
JOURNAL OF SUPERCOMPUTING, 2022, 78 (04): : 5181 - 5202
[4] An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm
Sardar T.H.
Ansari Z.
Ansari, Zahid (zahid_cs@pace.edu.in), 1600, Springer (101): : 641 - 650
[5] Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering
Ansari Z.
Afzal A.
Sardar T.H.
Journal of The Institution of Engineers (India): Series B, 2019, 100 (02) : 95 - 103
[6] An Efficient MapReduce-based Adaptive K-Means Clustering for Large Dataset
Chowdhury, Tapan
Mukherjee, Arijit
Chakraborty, Susanta
2017 3RD IEEE INTERNATIONAL SYMPOSIUM ON NANOELECTRONIC AND INFORMATION SYSTEMS (INIS), 2017, : 157 - 162
[7] Distributed, MapReduce-based Nearest Neighbor and ε-ball Kernel k-Means
Tsapanos, Nikolaos
Tefas, Anastasios
Nikolaidis, Nikolaos
Pitas, Ioannis
2015 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI), 2015, : 509 - 515
[8] K-means Clustering Optimization Algorithm Based on MapReduce
Li, Zhihua
Song, Xudong
Zhu, Wenhui
Chen, Yanxia
PROCEEDINGS OF THE 2015 INTERNATIONAL SYMPOSIUM ON COMPUTERS & INFORMATICS, 2015, 13 : 198 - 203
[9] MapReduce-based distributed tensor clustering algorithm
Zhang, Hongjun
Li, Peng
Meng, Fanshuo
Fan, Weibei
Xue, Zhuangzhuang
NEURAL COMPUTING & APPLICATIONS, 2023, 35 (35): : 24633 - 24649
[10] MapReduce-based distributed tensor clustering algorithm
Hongjun Zhang
Peng Li
Fanshuo Meng
Weibei Fan
Zhuangzhuang Xue
Neural Computing and Applications, 2023, 35 : 24633 - 24649

← 1 2 3 4 5 →