REMOLD: An Efficient Model-based Clustering Algorithm For Large Datasets with Spark

被引:7
|
作者
Liang, Mingfei [1 ]
Li, Qingyong [1 ]
Geng, Yangli-ao [1 ]
Wang, Jianzhu [1 ]
Wei, Zhi [2 ]
机构
[1] Beijing Jiaotong Univ, Beijing Key Lab Transportat Data Anal & Min, Beijing, Peoples R China
[2] New Jersey Inst Technol, Coll Comp Sci, Dept Comp Sci, Newark, NJ 07102 USA
基金
北京市自然科学基金;
关键词
distributed clustering; density-based clustering; density estimation; Gaussian model; Spark;
D O I
10.1109/ICPADS.2017.00057
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Density-based clustering algorithms have the distinctive advantage of discovering arbitrarily shaped clusters, but they usually require a procedure to compute the distance between every pair of data points, and this procedure is prohibitive for large datasets since it has quadratic computation complexity. In this paper, we propose a new distributed clustering algorithm, named REstore MOdel with Local Density estimation (REMOLD). Firstly, REMODL applies a balanced partitioning method to evenly divide an large dataset based on Local Sensitive Hashing (LSH). Then, it locally clusters each partition of the dataset, and uses a Gaussian model to represent each local cluster based on the observation that the density distribution of each local cluster shares similar shape with Gaussian distribution. Finally, these models are aggregated on a server where REMOLD restores global clusters based on these local Gaussian models. More specifically, model connection, which measures the density connectivity between two models, are defined to merge local models with an optimized procedure. In this aggregation, REMOLD requires low cost of network transmission for local Gaussian models, since the number of Gaussian models is often less than that of core objects for each partition. We evaluate REMOLD on three synthetic datasets and three real-world datasets on Spark, and the experiment results demonstrate that REMOLD is efficient and effective to find out clusters with complex shapes and it outperforms the established methods.
引用
收藏
页码:376 / 383
页数:8
相关论文
共 50 条
  • [1] Hierarchical model-based clustering for large datasets
    Posse, C
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2001, 10 (03) : 464 - 486
  • [2] Incremental model-based clustering for large datasets with small clusters
    Fraley, C
    Raftery, A
    Wehrens, R
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2005, 14 (03) : 529 - 546
  • [3] Efficient incremental density-based algorithm for clustering large datasets
    Bakr, Ahmad M.
    Ghanem, Nagia M.
    Ismail, Mohamed A.
    [J]. ALEXANDRIA ENGINEERING JOURNAL, 2015, 54 (04) : 1147 - 1154
  • [4] Model-Based Clustering for Image Segmentation and Large Datasets via Sampling
    Ron Wehrens
    Lutgarde M.C. Buydens
    Chris Fraley
    Adrian E. Raftery
    [J]. Journal of Classification, 2004, 21 : 231 - 253
  • [5] Hierarchical model-based clustering of large datasets through fractionation and refractionation
    Tantrum, J
    Murua, A
    Stuetzle, W
    [J]. INFORMATION SYSTEMS, 2004, 29 (04) : 315 - 326
  • [6] Model-based clustering for image segmentation and large datasets via sampling
    Wehrens, R
    Buydens, LMC
    Fraley, C
    Raftery, AE
    [J]. JOURNAL OF CLASSIFICATION, 2004, 21 (02) : 231 - 253
  • [7] NBC: An Efficient Hierarchical Clustering Algorithm for Large Datasets
    Zhang, Wei
    Zhang, Gongxuan
    Wang, Yongli
    Zhu, Zhaomeng
    Li, Tao
    [J]. INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2015, 9 (03) : 307 - 331
  • [8] Fast and efficient model-based clustering with the Ascent-EM algorithm
    Jank, W
    [J]. NEXT WAVE IN COMPUTING, OPTIMIZATION, AND DECISION TECHNOLOGIES, 2005, 29 : 201 - 212
  • [9] MODEL-BASED CLUSTERING OF LARGE NETWORKS
    Vu, Duy Q.
    Hunter, David R.
    Schweinberger, Michael
    [J]. ANNALS OF APPLIED STATISTICS, 2013, 7 (02): : 1010 - 1039
  • [10] A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets
    Sharma, Ashok
    Podolsky, Robert
    Zhao, Jieping
    McIndoe, Richard A.
    [J]. BIOINFORMATICS, 2009, 25 (09) : 1152 - 1157