Determine the number of clusters by data augmentation

被引:0
|
作者
Luo, Wei [1 ]
机构
[1] Zhejiang Univ, Ctr Data Sci, 866 Yuhangtang Rd, Hangzhou, Peoples R China
来源
ELECTRONIC JOURNAL OF STATISTICS | 2022年 / 16卷 / 02期
基金
美国国家科学基金会;
关键词
Data augmentation; instability of clustering; model-based clustering; order determination; VARIABLE SELECTION; MODEL; LIKELIHOOD;
D O I
10.1214/22-EJS2032
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Determining the number of clusters is crucial for the successful application of clustering. In this paper, we propose a new order-determination method called the data augmentation estimator (DAE), for the general model-based clustering. The estimator is based on a novel idea that augments data with an independently generated small cluster, which enables us to justify how the instability of clustering changes with the number of clusters assumed in clustering. The pattern of instability provides an alternative characterization of the true number of clusters to the commonly used goodness-of-fit measure. By combining the two sources of information appropriately, the proposed estimator reaches asymptotic consistency under general conditions and is easily implementable. It is also more efficient than the conventional BIC-type approaches that use the goodness-of-fit measure only. These properties are illustrated by the simulation studies and real data examples at the end.
引用
收藏
页码:3910 / 3936
页数:27
相关论文
共 50 条
  • [1] An Evolving Fuzzy Model to Determine an Optimal Number of Data Stream Clusters
    Al-Khamees, Hussein A. A.
    Al-A'araji, Nabeel
    Al-Shamery, Eman S.
    [J]. INTERNATIONAL JOURNAL OF FUZZY LOGIC AND INTELLIGENT SYSTEMS, 2022, 22 (03) : 267 - 275
  • [2] Kernel MDL to determine the number of clusters
    Kyrgyzov, Ivan O.
    Kyrgyzov, Olexiy O.
    Maitre, Henri
    Campedel, Marine
    [J]. MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, PROCEEDINGS, 2007, 4571 : 203 - +
  • [3] An Approach to Determine the Number of Clusters for Clustering Algorithms
    Dinh Thuan Nguyen
    Huan Doan
    [J]. COMPUTATIONAL COLLECTIVE INTELLIGENCE - TECHNOLOGIES AND APPLICATIONS, PT I, 2012, 7653 : 485 - 494
  • [4] Using the Negentropy Increment to Determine the Number of Clusters
    Lago-Fernandez, Luis F.
    Corbacho, Fernando
    [J]. BIO-INSPIRED SYSTEMS: COMPUTATIONAL AND AMBIENT INTELLIGENCE, PT 1, 2009, 5517 : 448 - +
  • [5] An Adaptive Method to Determine the Number of Clusters in Clustering Process
    Huan Doan
    Dinh Thuan Nguyen
    [J]. 2014 INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCES (ICCOINS), 2014,
  • [6] Using the stability of objects to determine the number of clusters in datasets
    Lord, Etienne
    Willems, Matthieu
    Lapointe, Francois-Joseph
    Makarenkov, Vladimir
    [J]. INFORMATION SCIENCES, 2017, 393 : 29 - 46
  • [7] Enhanced Dark Block Extraction Method Performed Automatically to Determine the Number of Clusters in Unlabeled Data Sets
    Prabhu, P.
    Duraiswamy, K.
    [J]. INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, 2013, 8 (02) : 275 - 293
  • [8] A New Approach to Determine the Optimal Number of Clusters Based on the Gap Statistic
    Yang, Jaekyung
    Lee, Jong-Yeong
    Choi, Myoungjin
    Joo, Yeongin
    [J]. MACHINE LEARNING FOR NETWORKING (MLN 2019), 2020, 12081 : 227 - 239
  • [9] A Method to Determine the Number of Clusters Based on Multi-validity Index
    Sun, Ning
    Yu, Hong
    [J]. ROUGH SETS, IJCRS 2018, 2018, 11103 : 427 - 439
  • [10] DETERMINE OPTIMUM NUMBER OF COMPACT OVERLAPPED CLUSTERS USING FRLVQ TECHNIQUE
    Xu Wenhuan Huang Qiang Ji Zhen Zhang Jihong (Faculty of Information Engineering
    [J]. Journal of Electronics(China), 2005, (06) : 110 - 114