An ensemble method for estimating the number of clusters in a big data set using multiple random samples

被引:5
|
作者
Mahmud, Mohammad Sultan [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ]
Ruby, Rukhsana [3 ]
Wu, Kaishun [1 ,2 ]
机构
[1] Shenzhen Univ, Big Data Inst, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[3] Guangdong Lab Artificial Intelligence & Digital Ec, Shenzhen 518107, Peoples R China
基金
中国国家自然科学基金;
关键词
Ensemble learning; Number of clusters; Random sample partition; Cluster ball model; Approximate computing; I-NICE; ALGORITHM; CONSENSUS;
D O I
10.1186/s40537-023-00709-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Clustering a big dataset without knowing the number of clusters presents a big challenge to many existing clustering algorithms. In this paper, we propose a Random Sample Partition-based Centers Ensemble (RSPCE) algorithm to identify the number of clusters in a big dataset. In this algorithm, a set of disjoint random samples is selected from the big dataset, and the I-niceDP algorithm is used to identify the number of clusters and initial centers in each sample. Subsequently, a cluster ball model is proposed to merge two clusters in the random samples that are likely sampled from the same cluster in the big dataset. Finally, based on the ball model, the RSPCE ensemble method is used to ensemble the results of all samples into the final result as a set of initial cluster centers in the big dataset. Intensive experiments were conducted on both synthetic and real datasets to validate the feasibility and effectiveness of the proposed RSPCE algorithm. The experimental results show that the ensemble result from multiple random samples is a reliable approximation of the actual number of clusters, and the RSPCE algorithm is scalable to big data.
引用
收藏
页数:33
相关论文
共 50 条
  • [21] AN EXAMINATION OF PROCEDURES FOR DETERMINING THE NUMBER OF CLUSTERS IN A DATA SET
    MILLIGAN, GW
    COOPER, MC
    [J]. PSYCHOMETRIKA, 1985, 50 (02) : 159 - 179
  • [22] Principal Components Analysis Random Discretization Ensemble for Big Data
    Garcia-Gil, Diego
    Ramirez-Gallego, Sergio
    Garcia, Salvador
    Herrera, Francisco
    [J]. KNOWLEDGE-BASED SYSTEMS, 2018, 150 : 166 - 174
  • [23] An Ensemble Random Forest Algorithm for Insurance Big Data Analysis
    Lin, Weiwei
    Wu, Ziming
    Lin, Longxin
    Wen, Angzhan
    Li, Jin
    [J]. IEEE ACCESS, 2017, 5 : 16568 - 16575
  • [24] An Ensemble Random Forest Algorithm for Insurance Big Data Analysis
    Wu, Ziming
    Lin, Weiwei
    Zhang, Zilong
    Wen, Angzhan
    Lin, Longxin
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE) AND IEEE/IFIP INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (EUC), VOL 1, 2017, : 531 - 536
  • [25] A Classifier Using Online Bagging Ensemble Method for Big Data Stream Learning
    Yanxia Lv
    Sancheng Peng
    Ying Yuan
    Cong Wang
    Pengfei Yin
    Jiemin Liu
    Cuirong Wang
    [J]. Tsinghua Science and Technology, 2019, (04) : 379 - 388
  • [26] A Classifier Using Online Bagging Ensemble Method for Big Data Stream Learning
    Yanxia Lv
    Sancheng Peng
    Ying Yuan
    Cong Wang
    Pengfei Yin
    Jiemin Liu
    Cuirong Wang
    [J]. Tsinghua Science and Technology., 2019, 24 (04) - 388
  • [27] A Classifier Using Online Bagging Ensemble Method for Big Data Stream Learning
    Lv, Yanxia
    Peng, Sancheng
    Yuan, Ying
    Wang, Cong
    Yin, Pengfei
    Liu, Jiemin
    Wang, Cuirong
    [J]. TSINGHUA SCIENCE AND TECHNOLOGY, 2019, 24 (04) : 379 - 388
  • [28] TRACKING OF RANDOM NUMBER OF TARGETS WITH RANDOM NUMBER OF SENSORS USING RANDOM FINITE SET THEORY
    Ali, Andreas M.
    Hudson, Ralph E.
    Yao, Kung
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 2217 - 2220
  • [29] A hierarchical Gamma Mixture Model-based method for estimating the number of clusters in complex data
    Azhar, Muhammad
    Huang, Joshua Zhexue
    Masud, Md Abdul
    Li, Mark Junjie
    Cui, Laizhong
    [J]. APPLIED SOFT COMPUTING, 2020, 87
  • [30] An automatic method to determine the number of clusters using decision-theoretic rough set
    Yu, Hong
    Liu, Zhanguo
    Wang, Guoyin
    [J]. INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2014, 55 (01) : 101 - 115