MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm

被引：6

作者：

Li, Yufeng ^{[1
]}

Jiang, HaiTian ^{[2
]}

Lu, Jiyong ^{[2
]}

Li, Xiaozhong ^{[1
]}

Sun, Zhiwei ^{[1
]}

Li, Min ^{[1
]}

机构：

[1] Tianjin Univ Sci & Technol, Coll Artificial Intelligence, Tianjin, Peoples R China

[2] Tianjin Univ Sci & Technol, Coll Sci, Tianjin, Peoples R China

来源：

JOURNAL OF INTELLIGENT & FUZZY SYSTEMS | 2021年 / 40卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Clustering; BIRCH; k-means; MapReduce; Hadoop;

D O I：

10.3233/JIFS-202079

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many classical clustering algorithms have been fitted into MapReduce, which provides a novel solution for clustering big data. However, several iterations are required to reach an acceptable result in most of the algorithms. For each iteration, a new MapReduce job must be executed to load the dataset into main memory, which results in high I/O overhead and poor efficiency. BIRCH algorithm stores only the statistical information of objects with CF entries and CF tree to cluster big data, but with the increase of the tree nodes, the main memory will be insufficient to contain more objects. Hence, BIRCH has to reduce the tree, which will degrade the clustering quality and decelerate the whole execution efficiency. To deal with the problem, BIRCH was fitted into MapReduce called MR-BIRCH in this paper. In contrast to a great number of MapReduce-based algorithms, MR-BIRCH loads dataset only once, and the dataset is processed parallel in several machines. The complexity and scalability were analyzed to evaluate the quality of MR-BIRCH, and MR-BIRCH was compared with Python sklearn BIRCH and Apache Mahout k-means on real-world and synthetic datasets. Experimental results show, most of the time, MR-BIRCH was better or equal to sklearn BIRCH, and it was competitive to Mahout k-means.

引用

页码：5295 / 5305

页数：11

共 50 条

[1] MapReduce-based distributed tensor clustering algorithm
Zhang, Hongjun
Li, Peng
Meng, Fanshuo
Fan, Weibei
Xue, Zhuangzhuang
[J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (35): : 24633 - 24649
[2] MR-DBSCAN:a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
Yaobin HE
Haoyu TAN
Wuman LUO
Shengzhong FENG
Jianping FAN
[J]. Frontiers of Computer Science, 2014, 8 (01) : 83 - 99
[3] MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
Yaobin He
Haoyu Tan
Wuman Luo
Shengzhong Feng
Jianping Fan
[J]. Frontiers of Computer Science, 2014, 8 : 83 - 99
[4] MapReduce-based distributed tensor clustering algorithm
Hongjun Zhang
Peng Li
Fanshuo Meng
Weibei Fan
Zhuangzhuang Xue
[J]. Neural Computing and Applications, 2023, 35 : 24633 - 24649
[5] MapReduce-Based Graph Structural Clustering Algorithm
Zhang W.-P.
Li Z.-J.
Li R.-H.
Liu Y.-H.
Mao R.
Qiao S.-J.
[J]. Ruan Jian Xue Bao/Journal of Software, 2018, 29 (03): : 627 - 641
[6] MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
He, Yaobin
Tan, Haoyu
Luo, Wuman
Feng, Shengzhong
Fan, Jianping
[J]. FRONTIERS OF COMPUTER SCIENCE, 2014, 8 (01) : 83 - 99
[7] A MapReduce-based K-means clustering algorithm
YiMin Mao
DeJin Gan
D. S. Mwakapesa
Y. A. Nanehkaran
Tao Tao
XueYu Huang
[J]. The Journal of Supercomputing, 2022, 78 : 5181 - 5202
[8] A MapReduce-based K-means clustering algorithm
Mao, YiMin
Gan, DeJin
Mwakapesa, D. S.
Nanehkaran, Y. A.
Tao, Tao
Huang, XueYu
[J]. JOURNAL OF SUPERCOMPUTING, 2022, 78 (04): : 5181 - 5202
[9] Variations on the Clustering Algorithm BIRCH
Lorbeer, Boris
Kosareva, Ana
Deva, Bersant
Softic, Dzenan
Ruppel, Peter
Kuepper, Axel
[J]. BIG DATA RESEARCH, 2018, 11 : 44 - 53
[10] MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering
Sardar T.H.
Ansari Z.
[J]. Journal of The Institution of Engineers (India): Series B, 2022, 103 (01) : 131 - 142

← 1 2 3 4 5 →