An ensemble method for estimating the number of clusters in a big data set using multiple random samples

被引:5
|
作者
Mahmud, Mohammad Sultan [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ]
Ruby, Rukhsana [3 ]
Wu, Kaishun [1 ,2 ]
机构
[1] Shenzhen Univ, Big Data Inst, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[3] Guangdong Lab Artificial Intelligence & Digital Ec, Shenzhen 518107, Peoples R China
基金
中国国家自然科学基金;
关键词
Ensemble learning; Number of clusters; Random sample partition; Cluster ball model; Approximate computing; I-NICE; ALGORITHM; CONSENSUS;
D O I
10.1186/s40537-023-00709-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Clustering a big dataset without knowing the number of clusters presents a big challenge to many existing clustering algorithms. In this paper, we propose a Random Sample Partition-based Centers Ensemble (RSPCE) algorithm to identify the number of clusters in a big dataset. In this algorithm, a set of disjoint random samples is selected from the big dataset, and the I-niceDP algorithm is used to identify the number of clusters and initial centers in each sample. Subsequently, a cluster ball model is proposed to merge two clusters in the random samples that are likely sampled from the same cluster in the big dataset. Finally, based on the ball model, the RSPCE ensemble method is used to ensemble the results of all samples into the final result as a set of initial cluster centers in the big dataset. Intensive experiments were conducted on both synthetic and real datasets to validate the feasibility and effectiveness of the proposed RSPCE algorithm. The experimental results show that the ensemble result from multiple random samples is a reliable approximation of the actual number of clusters, and the RSPCE algorithm is scalable to big data.
引用
收藏
页数:33
相关论文
共 50 条
  • [1] An ensemble method for estimating the number of clusters in a big data set using multiple random samples
    Mohammad Sultan Mahmud
    Joshua Zhexue Huang
    Rukhsana Ruby
    Kaishun Wu
    [J]. Journal of Big Data, 10
  • [2] A hybrid method for estimating the predominant number of clusters in a data set
    Al Shaqsi, Jamil
    Wang, Wenjia
    [J]. 2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2, 2012, : 569 - 573
  • [3] Recovery From Random Samples in a Big Data Set
    Molavipour, Sina
    Gohari, Amin
    [J]. IEEE COMMUNICATIONS LETTERS, 2015, 19 (11) : 1929 - 1932
  • [4] Estimating the number of clusters in a data set via the gap statistic
    Tibshirani, R
    Walther, G
    Hastie, T
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2001, 63 : 411 - 423
  • [5] A Multicriteria Decision Making Approach for Estimating the Number of Clusters in a Data Set
    Peng, Yi
    Zhang, Yong
    Kou, Gang
    Shi, Yong
    [J]. PLOS ONE, 2012, 7 (07):
  • [6] Estimating the number of clusters from distributional results of partitioning a given data set
    Möller, U
    [J]. Adaptive and Natural Computing Algorithms, 2005, : 151 - 154
  • [7] Estimating the number of clusters in a numerical data set via quantization error modeling
    Kolesnikov, Alexander
    Trichina, Elena
    Kauranne, Tuomo
    [J]. PATTERN RECOGNITION, 2015, 48 (03) : 941 - 952
  • [8] Frequent Item set Using Abundant Data on Hadoop Clusters in Big Data
    Danapaquiame, N.
    Balaji, V.
    Gayathri, R.
    Kodhai, E.
    Sambasivam, G.
    [J]. BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS, 2018, 11 (01): : 104 - 112
  • [9] Evaluation of the number of clusters in a data set using p-values from multiple tests of hypotheses
    Modak, Soumita
    [J]. COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2024,
  • [10] Estimating the Optimal Number of Clusters k in a Dataset Using Data Depth
    Patil, Channamma
    Baidari, Ishwar
    [J]. DATA SCIENCE AND ENGINEERING, 2019, 4 (02) : 132 - 140