A MapReduce-based K-means clustering algorithm

被引:4
|
作者
Mao, YiMin [1 ]
Gan, DeJin [1 ]
Mwakapesa, D. S. [1 ]
Nanehkaran, Y. A. [1 ]
Tao, Tao [1 ]
Huang, XueYu [1 ]
机构
[1] Jiangxi Univ Sci & Technol, Sch Informat Engn, Ganzhou 341000, Jiangxi, Peoples R China
来源
JOURNAL OF SUPERCOMPUTING | 2022年 / 78卷 / 04期
基金
中国国家自然科学基金;
关键词
K-means; Big data; LSH; MapReduce; Grid density; X-ARCHITECTURE; BIG; EFFICIENT; MODEL;
D O I
10.1007/s11227-021-04078-8
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The partitioning-based k-means clustering is one of the most important clustering algorithms. However, in big data environment, it faces the problems of random selection of initial cluster centers randomly, expensive communication overhead among MapReduce nodes and data skewing in data partitions, and others. To solve these problems, this paper proposes a parallel clustering algorithm based on grid density and local sensitive hash function (MR-PGDLSH) which takes into account the advantages of MapReduce and LSH (locality sensitive hash function). In the MR-PGDLSH, firstly the GDS (grid density strategy) is designed to obtain the relatively reasonable initial cluster centers. Then, a DP-LSH (data partition based on locality sensitive hash function) is proposed to divide the data set into multiple segments. The relevant data objects are mapped to the same sub-data set. The similarity function is designed to generate clusters, thereby reducing frequent communication overhead between nodes. Next, the AGS (adaptive grouping strategy) is applied to distribute the amount of data on each node evenly, which solves the problem of data skew on the node. Finally, the MR-PGDLSH is applied to mine the cluster centers in parallel, which obtains the final clustering results. Both theoretical analysis and experimental results have shown that the MR-PGDLSH is superior to the existing clustering algorithms.
引用
收藏
页码:5181 / 5202
页数:22
相关论文
共 50 条
  • [21] MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering
    Sardar T.H.
    Ansari Z.
    [J]. Journal of The Institution of Engineers (India): Series B, 2022, 103 (01): : 131 - 142
  • [22] MapReduce-based distributed tensor clustering algorithm
    Zhang, Hongjun
    Li, Peng
    Meng, Fanshuo
    Fan, Weibei
    Xue, Zhuangzhuang
    [J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (35): : 24633 - 24649
  • [23] K-Means Parallel Algorithm of Big Data Clustering Based on Mapreduce PCAM Method
    Li, Yongyi
    Yang, Zhongqiang
    Han, Kaixu
    [J]. Engineering Intelligent Systems, 2021, 29 (06): : 411 - 418
  • [24] An Improved K-means Algorithm based on Mapreduce and Grid
    Ma, Li
    Gu, Lei
    Li, Bo
    Ma, Yue
    Wang, Jin
    [J]. INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2015, 8 (01): : 189 - 199
  • [25] MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability
    Ludwig, Simone A.
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2015, 6 (06) : 923 - 934
  • [26] Research on k-means Clustering Algorithm An Improved k-means Clustering Algorithm
    Shi Na
    Liu Xumin
    Guan Yong
    [J]. 2010 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY AND SECURITY INFORMATICS (IITSI 2010), 2010, : 63 - 67
  • [27] MapReduce-based distributed tensor clustering algorithm
    Hongjun Zhang
    Peng Li
    Fanshuo Meng
    Weibei Fan
    Zhuangzhuang Xue
    [J]. Neural Computing and Applications, 2023, 35 : 24633 - 24649
  • [28] MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability
    Simone A. Ludwig
    [J]. International Journal of Machine Learning and Cybernetics, 2015, 6 : 923 - 934
  • [29] A Clustering Method Based on K-Means Algorithm
    Li, Youguo
    Wu, Haiyan
    [J]. INTERNATIONAL CONFERENCE ON SOLID STATE DEVICES AND MATERIALS SCIENCE, 2012, 25 : 1104 - 1109
  • [30] A Fuzzy Clustering Algorithm Based on K-means
    Yan, Zhen
    Pi, Dechang
    [J]. ECBI: 2009 INTERNATIONAL CONFERENCE ON ELECTRONIC COMMERCE AND BUSINESS INTELLIGENCE, PROCEEDINGS, 2009, : 523 - 528