MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm

被引:6
|
作者
Li, Yufeng [1 ]
Jiang, HaiTian [2 ]
Lu, Jiyong [2 ]
Li, Xiaozhong [1 ]
Sun, Zhiwei [1 ]
Li, Min [1 ]
机构
[1] Tianjin Univ Sci & Technol, Coll Artificial Intelligence, Tianjin, Peoples R China
[2] Tianjin Univ Sci & Technol, Coll Sci, Tianjin, Peoples R China
基金
中国国家自然科学基金;
关键词
Clustering; BIRCH; k-means; MapReduce; Hadoop;
D O I
10.3233/JIFS-202079
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many classical clustering algorithms have been fitted into MapReduce, which provides a novel solution for clustering big data. However, several iterations are required to reach an acceptable result in most of the algorithms. For each iteration, a new MapReduce job must be executed to load the dataset into main memory, which results in high I/O overhead and poor efficiency. BIRCH algorithm stores only the statistical information of objects with CF entries and CF tree to cluster big data, but with the increase of the tree nodes, the main memory will be insufficient to contain more objects. Hence, BIRCH has to reduce the tree, which will degrade the clustering quality and decelerate the whole execution efficiency. To deal with the problem, BIRCH was fitted into MapReduce called MR-BIRCH in this paper. In contrast to a great number of MapReduce-based algorithms, MR-BIRCH loads dataset only once, and the dataset is processed parallel in several machines. The complexity and scalability were analyzed to evaluate the quality of MR-BIRCH, and MR-BIRCH was compared with Python sklearn BIRCH and Apache Mahout k-means on real-world and synthetic datasets. Experimental results show, most of the time, MR-BIRCH was better or equal to sklearn BIRCH, and it was competitive to Mahout k-means.
引用
收藏
页码:5295 / 5305
页数:11
相关论文
共 50 条
  • [21] MapReduce-based H-mine algorithm
    Feng, Xingjie
    Zhao, Jie
    Zhang, Zhiyuan
    [J]. 2015 FIFTH INTERNATIONAL CONFERENCE ON INSTRUMENTATION AND MEASUREMENT, COMPUTER, COMMUNICATION AND CONTROL (IMCCC), 2015, : 1755 - 1760
  • [22] A MapReduce-Based Distributed SVM for Scalable Data Type Classification
    Jiang, Chong
    Wu, Ting
    Xu, Jian
    Zheng, Ning
    Xu, Ming
    Yang, Tao
    [J]. COLLABORATE COMPUTING: NETWORKING, APPLICATIONS AND WORKSHARING, COLLABORATECOM 2016, 2017, 201 : 115 - 126
  • [23] BIRCH: A New Data Clustering Algorithm and Its Applications
    Tian Zhang
    Raghu Ramakrishnan
    Miron Livny
    [J]. Data Mining and Knowledge Discovery, 1997, 1 : 141 - 182
  • [24] Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-scale Heterogeneous Supercomputers
    Shirahata, Koichi
    Sato, Hitoshi
    Suzumura, Toyotaro
    Matsuoka, Satoshi
    [J]. PROCEEDINGS OF THE 2013 13TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID 2013), 2013, : 277 - 284
  • [25] BIRCH: A new data clustering algorithm and its applications
    Zhang, T
    Ramakrishnan, R
    Livny, M
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 1997, 1 (02) : 141 - 182
  • [26] MassJoin: A MapReduce-based Method for Scalable String Similarity Joins
    Deng, Dong
    Li, Guoliang
    Hao, Shuang
    Wang, Jiannan
    Feng, Jianhua
    [J]. 2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 340 - 351
  • [27] A MapReduce-based scalable discovery and indexing of structured big data
    Singh, Hari
    Bawa, Seema
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 73 : 32 - 43
  • [28] MapReduce-Based Growing Neural Gas for Scalable Cluster Environments
    Fliege, Johannes
    Benn, Wolfgang
    [J]. MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION (MLDM 2016), 2016, 9729 : 545 - 559
  • [29] A MapReduce-Based Algorithm for Parallelizing Collusion Detection in Hadoop
    Mortazavi, Mahmood
    Ladani, Behrouz Tork
    [J]. 2015 7TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2015,
  • [30] Charging Behavior Analysis Based on BIRCH Clustering
    Yan, Dong
    Luo, Chong-Yang
    Li, Yulan
    Zhu, Bin
    Yan, Miao-Long
    Yao, Shu-Li
    [J]. 2022 12TH INTERNATIONAL CONFERENCE ON POWER AND ENERGY SYSTEMS, ICPES, 2022, : 450 - 454