Stability and model selection in k-means clustering

被引:22
|
作者
Shamir, Ohad [1 ]
Tishby, Naftali [1 ,2 ]
机构
[1] Hebrew Univ Jerusalem, Sch Comp Sci & Engn, IL-91904 Jerusalem, Israel
[2] Hebrew Univ Jerusalem, Interdisciplinary Ctr Neural Computat, IL-91904 Jerusalem, Israel
关键词
Clustering; Model selection; Stability; Statistical learning theory; VALIDATION;
D O I
10.1007/s10994-010-5177-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering stability methods are a family of widely used model selection techniques for data clustering. Their unifying theme is that an appropriate model should result in a clustering which is robust with respect to various kinds of perturbations. Despite their relative success, not much is known theoretically on why or when do they work, or even what kind of assumptions they make in choosing an 'appropriate' model. Moreover, recent theoretical work has shown that they might 'break down' for large enough samples. In this paper, we focus on the behavior of clustering stability using k-means clustering. Our main technical result is an exact characterization of the distribution to which suitably scaled measures of instability converge, based on a sample drawn from any distribution in a"e (n) satisfying mild regularity conditions. From this, we can show that clustering stability does not 'break down' even for arbitrarily large samples, at least for the k-means framework. Moreover, it allows us to identify the factors which eventually determine the behavior of clustering stability. This leads to some basic observations about what kind of assumptions are made when using these methods. While often reasonable, these assumptions might also lead to unexpected consequences.
引用
收藏
页码:213 / 243
页数:31
相关论文
共 50 条
  • [1] Stability and model selection in k-means clustering
    Ohad Shamir
    Naftali Tishby
    [J]. Machine Learning, 2010, 80 : 213 - 243
  • [2] Selection of K in K-means clustering
    Pham, DT
    Dimov, SS
    Nguyen, CD
    [J]. PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART C-JOURNAL OF MECHANICAL ENGINEERING SCIENCE, 2005, 219 (01) : 103 - 119
  • [3] Stability of k-means clustering
    Ben-David, Shai
    Pal, Ddvid
    Simon, Hans Ulrich
    [J]. LEARNING THEORY, PROCEEDINGS, 2007, 4539 : 20 - +
  • [4] Degrees of freedom and model selection for k-means clustering
    Hofmeyr, David P.
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2020, 149 (149)
  • [5] A notion of stability for k-means clustering
    Le Gouic, T.
    Paris, Q.
    [J]. ELECTRONIC JOURNAL OF STATISTICS, 2018, 12 (02): : 4239 - 4263
  • [6] Stability analysis in K-means clustering
    Steinley, Douglas
    [J]. BRITISH JOURNAL OF MATHEMATICAL & STATISTICAL PSYCHOLOGY, 2008, 61 : 255 - 273
  • [7] A Novel Stability Based Feature Selection Framework for k-means Clustering
    Mavroeidis, Dimitrios
    Marchiori, Elena
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2011, 6912 : 421 - 436
  • [8] Feature selection for k-means clustering stability: theoretical analysis and an algorithm
    Mavroeidis, Dimitrios
    Marchiori, Elena
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2014, 28 (04) : 918 - 960
  • [9] Feature selection for k-means clustering stability: theoretical analysis and an algorithm
    Dimitrios Mavroeidis
    Elena Marchiori
    [J]. Data Mining and Knowledge Discovery, 2014, 28 : 918 - 960
  • [10] Deterministic Feature Selection for k-Means Clustering
    Boutsidis, Christos
    Magdon-Ismail, Malik
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2013, 59 (09) : 6099 - 6110