A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引:0
|
作者
Ankita Sinha
Prasanta K. Jana
机构
[1] IIT (ISM),Department of Computer Science and Engineering
[2] Dhanbad,undefined
来源
关键词
Mahalanobis distance; Apache Hadoop; -means++ initialization; Genetic algorithm;
D O I
暂无
中图分类号
学科分类号
摘要
Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means++\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ ++ $$\end{document} initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.
引用
收藏
页码:1562 / 1579
页数:17
相关论文
共 50 条
  • [1] A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets
    Sinha, Ankita
    Jana, Prasanta K.
    JOURNAL OF SUPERCOMPUTING, 2018, 74 (04): : 1562 - 1579
  • [2] A MapReduce-based K-means clustering algorithm
    YiMin Mao
    DeJin Gan
    D. S. Mwakapesa
    Y. A. Nanehkaran
    Tao Tao
    XueYu Huang
    The Journal of Supercomputing, 2022, 78 : 5181 - 5202
  • [3] A MapReduce-based K-means clustering algorithm
    Mao, YiMin
    Gan, DeJin
    Mwakapesa, D. S.
    Nanehkaran, Y. A.
    Tao, Tao
    Huang, XueYu
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (04): : 5181 - 5202
  • [4] An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm
    Sardar T.H.
    Ansari Z.
    Ansari, Zahid (zahid_cs@pace.edu.in), 1600, Springer (101): : 641 - 650
  • [5] Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering
    Ansari Z.
    Afzal A.
    Sardar T.H.
    Journal of The Institution of Engineers (India): Series B, 2019, 100 (02) : 95 - 103
  • [6] An Efficient MapReduce-based Adaptive K-Means Clustering for Large Dataset
    Chowdhury, Tapan
    Mukherjee, Arijit
    Chakraborty, Susanta
    2017 3RD IEEE INTERNATIONAL SYMPOSIUM ON NANOELECTRONIC AND INFORMATION SYSTEMS (INIS), 2017, : 157 - 162
  • [7] Distributed, MapReduce-based Nearest Neighbor and ε-ball Kernel k-Means
    Tsapanos, Nikolaos
    Tefas, Anastasios
    Nikolaidis, Nikolaos
    Pitas, Ioannis
    2015 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI), 2015, : 509 - 515
  • [8] K-means Clustering Optimization Algorithm Based on MapReduce
    Li, Zhihua
    Song, Xudong
    Zhu, Wenhui
    Chen, Yanxia
    PROCEEDINGS OF THE 2015 INTERNATIONAL SYMPOSIUM ON COMPUTERS & INFORMATICS, 2015, 13 : 198 - 203
  • [9] MapReduce-based distributed tensor clustering algorithm
    Zhang, Hongjun
    Li, Peng
    Meng, Fanshuo
    Fan, Weibei
    Xue, Zhuangzhuang
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (35): : 24633 - 24649
  • [10] MapReduce-based distributed tensor clustering algorithm
    Hongjun Zhang
    Peng Li
    Fanshuo Meng
    Weibei Fan
    Zhuangzhuang Xue
    Neural Computing and Applications, 2023, 35 : 24633 - 24649