MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm

被引:6
|
作者
Li, Yufeng [1 ]
Jiang, HaiTian [2 ]
Lu, Jiyong [2 ]
Li, Xiaozhong [1 ]
Sun, Zhiwei [1 ]
Li, Min [1 ]
机构
[1] Tianjin Univ Sci & Technol, Coll Artificial Intelligence, Tianjin, Peoples R China
[2] Tianjin Univ Sci & Technol, Coll Sci, Tianjin, Peoples R China
基金
中国国家自然科学基金;
关键词
Clustering; BIRCH; k-means; MapReduce; Hadoop;
D O I
10.3233/JIFS-202079
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many classical clustering algorithms have been fitted into MapReduce, which provides a novel solution for clustering big data. However, several iterations are required to reach an acceptable result in most of the algorithms. For each iteration, a new MapReduce job must be executed to load the dataset into main memory, which results in high I/O overhead and poor efficiency. BIRCH algorithm stores only the statistical information of objects with CF entries and CF tree to cluster big data, but with the increase of the tree nodes, the main memory will be insufficient to contain more objects. Hence, BIRCH has to reduce the tree, which will degrade the clustering quality and decelerate the whole execution efficiency. To deal with the problem, BIRCH was fitted into MapReduce called MR-BIRCH in this paper. In contrast to a great number of MapReduce-based algorithms, MR-BIRCH loads dataset only once, and the dataset is processed parallel in several machines. The complexity and scalability were analyzed to evaluate the quality of MR-BIRCH, and MR-BIRCH was compared with Python sklearn BIRCH and Apache Mahout k-means on real-world and synthetic datasets. Experimental results show, most of the time, MR-BIRCH was better or equal to sklearn BIRCH, and it was competitive to Mahout k-means.
引用
收藏
页码:5295 / 5305
页数:11
相关论文
共 50 条
  • [1] MapReduce-based distributed tensor clustering algorithm
    Zhang, Hongjun
    Li, Peng
    Meng, Fanshuo
    Fan, Weibei
    Xue, Zhuangzhuang
    [J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (35): : 24633 - 24649
  • [2] MR-DBSCAN:a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
    Yaobin HE
    Haoyu TAN
    Wuman LUO
    Shengzhong FENG
    Jianping FAN
    [J]. Frontiers of Computer Science, 2014, 8 (01) : 83 - 99
  • [3] MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
    Yaobin He
    Haoyu Tan
    Wuman Luo
    Shengzhong Feng
    Jianping Fan
    [J]. Frontiers of Computer Science, 2014, 8 : 83 - 99
  • [4] MapReduce-based distributed tensor clustering algorithm
    Hongjun Zhang
    Peng Li
    Fanshuo Meng
    Weibei Fan
    Zhuangzhuang Xue
    [J]. Neural Computing and Applications, 2023, 35 : 24633 - 24649
  • [5] MapReduce-Based Graph Structural Clustering Algorithm
    Zhang W.-P.
    Li Z.-J.
    Li R.-H.
    Liu Y.-H.
    Mao R.
    Qiao S.-J.
    [J]. Ruan Jian Xue Bao/Journal of Software, 2018, 29 (03): : 627 - 641
  • [6] MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
    He, Yaobin
    Tan, Haoyu
    Luo, Wuman
    Feng, Shengzhong
    Fan, Jianping
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2014, 8 (01) : 83 - 99
  • [7] A MapReduce-based K-means clustering algorithm
    YiMin Mao
    DeJin Gan
    D. S. Mwakapesa
    Y. A. Nanehkaran
    Tao Tao
    XueYu Huang
    [J]. The Journal of Supercomputing, 2022, 78 : 5181 - 5202
  • [8] A MapReduce-based K-means clustering algorithm
    Mao, YiMin
    Gan, DeJin
    Mwakapesa, D. S.
    Nanehkaran, Y. A.
    Tao, Tao
    Huang, XueYu
    [J]. JOURNAL OF SUPERCOMPUTING, 2022, 78 (04): : 5181 - 5202
  • [9] Variations on the Clustering Algorithm BIRCH
    Lorbeer, Boris
    Kosareva, Ana
    Deva, Bersant
    Softic, Dzenan
    Ruppel, Peter
    Kuepper, Axel
    [J]. BIG DATA RESEARCH, 2018, 11 : 44 - 53
  • [10] MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering
    Sardar T.H.
    Ansari Z.
    [J]. Journal of The Institution of Engineers (India): Series B, 2022, 103 (01) : 131 - 142