Determine the number of clusters by data augmentation

被引:0
|
作者
Luo, Wei [1 ]
机构
[1] Zhejiang Univ, Ctr Data Sci, 866 Yuhangtang Rd, Hangzhou, Peoples R China
来源
ELECTRONIC JOURNAL OF STATISTICS | 2022年 / 16卷 / 02期
基金
美国国家科学基金会;
关键词
Data augmentation; instability of clustering; model-based clustering; order determination; VARIABLE SELECTION; MODEL; LIKELIHOOD;
D O I
10.1214/22-EJS2032
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Determining the number of clusters is crucial for the successful application of clustering. In this paper, we propose a new order-determination method called the data augmentation estimator (DAE), for the general model-based clustering. The estimator is based on a novel idea that augments data with an independently generated small cluster, which enables us to justify how the instability of clustering changes with the number of clusters assumed in clustering. The pattern of instability provides an alternative characterization of the true number of clusters to the commonly used goodness-of-fit measure. By combining the two sources of information appropriately, the proposed estimator reaches asymptotic consistency under general conditions and is easily implementable. It is also more efficient than the conventional BIC-type approaches that use the goodness-of-fit measure only. These properties are illustrated by the simulation studies and real data examples at the end.
引用
收藏
页码:3910 / 3936
页数:27
相关论文
共 50 条
  • [31] An automatic method to determine the number of clusters using decision-theoretic rough set
    Yu, Hong
    Liu, Zhanguo
    Wang, Guoyin
    [J]. Yu, H. (yuhongcq@aliyun.com), 1600, Elsevier Inc. (55): : 101 - 115
  • [32] Performing Multi-Objective Optimization Alongside Dimension Reduction to Determine Number of Clusters
    Mollaian, Melisa
    Dorgo, Gyula
    Palazoglu, Ahmet
    [J]. PROCESSES, 2022, 10 (05)
  • [33] An automatic method to determine the number of clusters using decision-theoretic rough set
    Yu, Hong
    Liu, Zhanguo
    Wang, Guoyin
    [J]. INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2014, 55 (01) : 101 - 115
  • [34] Estimating the number of clusters in a data set via the gap statistic
    Tibshirani, R
    Walther, G
    Hastie, T
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2001, 63 : 411 - 423
  • [35] Automatic Determination of the Appropriate Number of Clusters for Multispectral Image Data
    Koonsanit, Kitti
    Jaruskulchai, Chuleerat
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2012, E95D (05): : 1256 - 1263
  • [36] A new method for GAN-based data augmentation for classes with distinct clusters
    Kuntalp, Mehmet
    Duzyel, Okan
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 235
  • [37] An examination of indexes for determining the number of clusters in binary data sets
    Evgenia Dimitriadou
    Sara Dolničar
    Andreas Weingessel
    [J]. Psychometrika, 2002, 67 : 137 - 159
  • [38] Determining the number of clusters using information entropy for mixed data
    Liang, Jiye
    Zhao, Xingwang
    Li, Deyu
    Cao, Fuyuan
    Dang, Chuangyin
    [J]. PATTERN RECOGNITION, 2012, 45 (06) : 2251 - 2265
  • [39] A Criterion for Deciding the Number of Clusters in a Dataset Based on Data Depth
    Baidari, Ishwar
    Patil, Channamma
    [J]. VIETNAM JOURNAL OF COMPUTER SCIENCE, 2020, 7 (04) : 417 - 431
  • [40] A hybrid method for estimating the predominant number of clusters in a data set
    Al Shaqsi, Jamil
    Wang, Wenjia
    [J]. 2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2, 2012, : 569 - 573