Effects of Resampling in Determining the Number of Clusters in a Data Set

被引:0
|
作者
Rainer Dangl
Friedrich Leisch
机构
[1] University of Natural Resources and Life Sciences,Institute for Applied Statistics and Computing
来源
Journal of Classification | 2020年 / 37卷
关键词
Resampling; Model validation; Cluster stability; Clustering; Benchmarking;
D O I
暂无
中图分类号
学科分类号
摘要
Using cluster validation indices is a widely applied method in order to detect the number of groups in a data set and as such a crucial step in the model validation process in clustering. The study presented in this paper demonstrates how the accuracy of certain indices can be significantly improved when calculated numerous times on data sets resampled from the original data. There are obviously many ways to resample data—in this study, three very common options are used: bootstrapping, data splitting (without subset overlap of two subsamples), and random subsetting (with subset overlap of two subsamples). Index values calculated on the basis of resampled data sets are compared to the values obtained from the original data partition. The primary hypothesis of the study states that resampling does generally improve index accuracy. The hypothesis is based on the notion of cluster stability: if there are stable clusters in a data set, a clustering algorithm should produce consistent results for data sampled or resampled from the same source. The primary hypothesis was partly confirmed; for external validation measures, it does indeed apply. The secondary hypothesis states that the resampling strategy itself does not play a significant role. This was also shown to be accurate, yet slight deviations between the resampling schemes suggest that splitting appears to yield slightly better results.
引用
收藏
页码:558 / 583
页数:25
相关论文
共 50 条
  • [31] EVALUATION OF COEFFICIENTS FOR DETERMINING THE OPTIMAL NUMBER OF CLUSTERS IN CLUSTER ANALYSIS ON REAL DATA SETS
    Loster, Tomas
    [J]. 9TH INTERNATIONAL DAYS OF STATISTICS AND ECONOMICS, 2015, : 1014 - 1023
  • [32] DETERMINING THE OPTIMAL NUMBER OF CLUSTERS IN CLUSTER ANALYSIS
    Loster, Tomas
    [J]. 10TH INTERNATIONAL DAYS OF STATISTICS AND ECONOMICS, 2016, : 1078 - 1090
  • [33] A Method for Automatically Determining The Number of Clusters of LAC
    Liu, Han
    Wu, Qingfeng
    Dong, Huailin
    Wang, Shuangshuang
    Cai, Qing
    Ma, Zhuo
    [J]. ICCSSE 2009: PROCEEDINGS OF 2009 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, 2009, : 1907 - +
  • [34] A Method to Find Optimum Number of Clusters Based on Fuzzy Silhouette on Dynamic Data Set
    Subbalakshmi, Chatti
    Krishna, G. Rama
    Rao, S. Krishna Mohan
    Rao, P. Venketeswa
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES, ICICT 2014, 2015, 46 : 346 - 353
  • [35] A CRITERION FOR DETERMINING THE NUMBER OF GROUPS IN A DATA SET USING SUM-OF-SQUARES CLUSTERING
    KRZANOWSKI, WJ
    LAI, YT
    [J]. BIOMETRICS, 1988, 44 (01) : 23 - 34
  • [36] Effective resampling approach for skewed distribution on imbalanced data set
    Nwe, Mar Mar
    Lynn, Khin Thidar
    [J]. IAENG International Journal of Computer Science, 2020, 47 (02): : 234 - 249
  • [37] THE EFFECTS OF RESAMPLING METHODS ON LINEAR DISCRIMINANT ANALYSIS FOR DATA SET WITH TWO IMBALANCED GROUPS: AN EMPIRICAL EVIDENCE
    Hakiim, J.
    Mahat, Nor Idayu
    [J]. ADVANCES AND APPLICATIONS IN STATISTICS, 2019, 59 (01) : 17 - 42
  • [38] An ensemble method for estimating the number of clusters in a big data set using multiple random samples
    Mahmud, Mohammad Sultan
    Huang, Joshua Zhexue
    Ruby, Rukhsana
    Wu, Kaishun
    [J]. JOURNAL OF BIG DATA, 2023, 10 (01)
  • [39] An ensemble method for estimating the number of clusters in a big data set using multiple random samples
    Mohammad Sultan Mahmud
    Joshua Zhexue Huang
    Rukhsana Ruby
    Kaishun Wu
    [J]. Journal of Big Data, 10
  • [40] Thresher: determining the number of clusters while removing outliers
    Min Wang
    Zachary B. Abrams
    Steven M. Kornblau
    Kevin R. Coombes
    [J]. BMC Bioinformatics, 19