A hierarchical Gamma Mixture Model-based method for estimating the number of clusters in complex data

被引:5
|
作者
Azhar, Muhammad [1 ]
Huang, Joshua Zhexue [1 ,2 ]
Masud, Md Abdul [1 ]
Li, Mark Junjie [1 ,2 ]
Cui, Laizhong [1 ,2 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Number of clusters; Initial cluster centers; Gamma Mixture Model (GMM); EM algorithm; Clustering algorithms; ALGORITHM; CENTERS; FIND;
D O I
10.1016/j.asoc.2019.105891
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a new method for estimating the true number of clusters and initial cluster centers in a dataset with many clusters. The observation points are assigned to the data space to observe the clusters through the distributions of the distances between the observation points and the objects in the dataset. A Gamma Mixture Model (GMM) is built from a distance distribution to partition the dataset into subsets, and a GMM tree is obtained by recursively partitioning the dataset. From the leaves of the GMM tree, a set of initial cluster centers are identified and the true number of clusters is estimated. This method is implemented in the new GMM-Tree algorithm. Two GMM forest algorithms are further proposed to ensemble multiple GMM trees to handle high dimensional data with many clusters. The GMM-P-Forest algorithm builds GMM trees in parallel, whereas the GMM-S-Forest algorithm uses a sequential process to build a GMM forest. Experiments were conducted on 32 synthetic datasets and 15 real datasets to evaluate the performance of the new algorithms. The results have shown that the proposed algorithms outperformed the existing popular methods: Silhouette, Elbow and Gap Statistic, and the recent method I-nice in estimating the true number of clusters from high dimensional complex data. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:20
相关论文
共 50 条
  • [21] Estimating the number of clusters in microarray data sets based on an information theoretic criterion
    Nicorici, Daniel
    Astola, Jaakko
    Yli-Harja, Olli
    [J]. 2005 IEEE/SP 13TH WORKSHOP ON STATISTICAL SIGNAL PROCESSING (SSP), VOLS 1 AND 2, 2005, : 936 - 940
  • [22] Mixture Model-Based Classification
    Zhao, Yunpeng
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (518) : 881 - 882
  • [23] Hierarchical model-based Internet controller interface design method
    Guo, R
    [J]. ISTM/2005: 6th International Symposium on Test and Measurement, Vols 1-9, Conference Proceedings, 2005, : 7484 - 7487
  • [24] Model-based hierarchical diagnosis method for distribution network faults
    Wang Q.
    Jin T.
    Mei L.
    Liu J.
    [J]. 1600, Electric Power Automation Equipment Press (40): : 73 - 79
  • [25] HIERARCHICAL MODEL-BASED DIAGNOSIS
    MOZETIC, I
    [J]. INTERNATIONAL JOURNAL OF MAN-MACHINE STUDIES, 1991, 35 (03): : 329 - 362
  • [26] A Model-Based Method for Estimating the Attitude of Underground Articulated Vehicles
    Gao, Lulu
    Ma, Fei
    Jin, Chun
    [J]. SENSORS, 2019, 19 (23)
  • [27] Finite mixture-of-gamma distributions: estimation, inference, and model-based clustering
    Young, Derek S.
    Chen, Xi
    Hewage, Dilrukshi C.
    Nilo-Poyanco, Ricardo
    [J]. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2019, 13 (04) : 1053 - 1082
  • [28] Finite mixture-of-gamma distributions: estimation, inference, and model-based clustering
    Derek S. Young
    Xi Chen
    Dilrukshi C. Hewage
    Ricardo Nilo-Poyanco
    [J]. Advances in Data Analysis and Classification, 2019, 13 : 1053 - 1082
  • [29] AN ALGORITHM FOR ESTIMATING NUMBER OF COMPONENTS OF GAUSSIAN MIXTURE MODEL BASED ON PENALIZED DISTANCE
    Zhang, Daming
    Guo, Hui
    Luo, Bin
    [J]. 2008 INTERNATIONAL CONFERENCE ON NEURAL NETWORKS AND SIGNAL PROCESSING, VOLS 1 AND 2, 2007, : 482 - +
  • [30] A hierarchical clustering method for estimating copy number variation
    Xing, Baifang
    Greenwood, Celia M. T.
    Bull, Shelley B.
    [J]. BIOSTATISTICS, 2007, 8 (03) : 632 - 653