A MapReduce-based K-means clustering algorithm

被引:4
|
作者
Mao, YiMin [1 ]
Gan, DeJin [1 ]
Mwakapesa, D. S. [1 ]
Nanehkaran, Y. A. [1 ]
Tao, Tao [1 ]
Huang, XueYu [1 ]
机构
[1] Jiangxi Univ Sci & Technol, Sch Informat Engn, Ganzhou 341000, Jiangxi, Peoples R China
来源
JOURNAL OF SUPERCOMPUTING | 2022年 / 78卷 / 04期
基金
中国国家自然科学基金;
关键词
K-means; Big data; LSH; MapReduce; Grid density; X-ARCHITECTURE; BIG; EFFICIENT; MODEL;
D O I
10.1007/s11227-021-04078-8
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The partitioning-based k-means clustering is one of the most important clustering algorithms. However, in big data environment, it faces the problems of random selection of initial cluster centers randomly, expensive communication overhead among MapReduce nodes and data skewing in data partitions, and others. To solve these problems, this paper proposes a parallel clustering algorithm based on grid density and local sensitive hash function (MR-PGDLSH) which takes into account the advantages of MapReduce and LSH (locality sensitive hash function). In the MR-PGDLSH, firstly the GDS (grid density strategy) is designed to obtain the relatively reasonable initial cluster centers. Then, a DP-LSH (data partition based on locality sensitive hash function) is proposed to divide the data set into multiple segments. The relevant data objects are mapped to the same sub-data set. The similarity function is designed to generate clusters, thereby reducing frequent communication overhead between nodes. Next, the AGS (adaptive grouping strategy) is applied to distribute the amount of data on each node evenly, which solves the problem of data skew on the node. Finally, the MR-PGDLSH is applied to mine the cluster centers in parallel, which obtains the final clustering results. Both theoretical analysis and experimental results have shown that the MR-PGDLSH is superior to the existing clustering algorithms.
引用
收藏
页码:5181 / 5202
页数:22
相关论文
共 50 条
  • [1] A MapReduce-based K-means clustering algorithm
    YiMin Mao
    DeJin Gan
    D. S. Mwakapesa
    Y. A. Nanehkaran
    Tao Tao
    XueYu Huang
    [J]. The Journal of Supercomputing, 2022, 78 : 5181 - 5202
  • [2] A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets
    Ankita Sinha
    Prasanta K. Jana
    [J]. The Journal of Supercomputing, 2018, 74 : 1562 - 1579
  • [3] A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets
    Sinha, Ankita
    Jana, Prasanta K.
    [J]. JOURNAL OF SUPERCOMPUTING, 2018, 74 (04): : 1562 - 1579
  • [4] An Efficient MapReduce-based Adaptive K-Means Clustering for Large Dataset
    Chowdhury, Tapan
    Mukherjee, Arijit
    Chakraborty, Susanta
    [J]. 2017 3RD IEEE INTERNATIONAL SYMPOSIUM ON NANOELECTRONIC AND INFORMATION SYSTEMS (INIS), 2017, : 157 - 162
  • [5] K-means Clustering Optimization Algorithm Based on MapReduce
    Li, Zhihua
    Song, Xudong
    Zhu, Wenhui
    Chen, Yanxia
    [J]. PROCEEDINGS OF THE 2015 INTERNATIONAL SYMPOSIUM ON COMPUTERS & INFORMATICS, 2015, 13 : 198 - 203
  • [6] Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering
    Ansari Z.
    Afzal A.
    Sardar T.H.
    [J]. Journal of The Institution of Engineers (India): Series B, 2019, 100 (2) : 95 - 103
  • [7] MapReduce Design of K-Means Clustering Algorithm
    Anchalia, Prajesh P.
    Koundinya, Anjan K.
    Srinath, N. K.
    [J]. 2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND APPLICATIONS (ICISA 2013), 2013,
  • [8] An Efficient K-means Clustering Algorithm on MapReduce
    Li, Qiuhong
    Wang, Peng
    Wang, Wei
    Hu, Hao
    Li, Zhongsheng
    Li, Junxian
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT I, 2014, 8421 : 357 - 371
  • [9] An Improved Sampling K-means Clustering Algorithm Based on MapReduce
    Zhang Ya-ling
    Wang Ya-nan
    [J]. 2017 13TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2017,
  • [10] An Improved parallel K-means Clustering Algorithm with MapReduce
    Liao, Qing
    Yang, Fan
    Zhao, Jingming
    [J]. 2013 15TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT), 2013, : 764 - 768