Finding the number of clusters in a dataset: An information-theoretic approach

被引:533
|
作者
Sugar, CA [1 ]
James, GM [1 ]
机构
[1] Univ So Calif, Marshall Sch Business, Los Angeles, CA 90089 USA
关键词
cluster analysis; distortion; information theory; k-means clustering; mixture models;
D O I
10.1198/016214503000000666
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
One of the most difficult problems in cluster analysis is identifying the number of groups in a dataset. Most previously suggested approaches to this problem are either somewhat ad hoc or require parametric assumptions and complicated calculations. In. this article we develop a simple, yet powerful nonparametric method for choosing the number of clusters based on distortion, a quantity that measures the average distance, per dimension, between each observation and its closest cluster center. Our technique is computationally efficient and straightforward to implement. We demonstrate empirically its effectiveness, not only for choosing the number of clusters, but also for identifying underlying structure, on a wide range of simulated and real world datasets. In addition, we give a rigorous theoretical justification for the method based on information-theoretic ideas. Specifically, results from the subfield of electrical engineering known as rate distortion theory allow us to describe the behavior of the distortion in both the presence and absence of clustering. Finally, we note that these ideas potentially can be extended to a wide range of other statistical model selection problems.
引用
收藏
页码:750 / 763
页数:14
相关论文
共 50 条
  • [21] An Information-Theoretic approach for Bug Triaging
    Yadav, Asmita
    Singh, Sandccp Kumar
    [J]. PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE CONFLUENCE 2018 ON CLOUD COMPUTING, DATA SCIENCE AND ENGINEERING, 2018, : 7 - 13
  • [22] An Information-Theoretic Approach to Analyzing CLEAN
    Bose, Ranjan
    [J]. IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS, 2014, 50 (03) : 1673 - 1679
  • [23] An information-theoretic approach for the quantification of relevance
    Polani, Daniel
    Martinetz, Thomas
    Kim, Jan
    [J]. ADVANCES IN ARTIFICIAL LIFE, 2001, 2159 : 704 - 713
  • [24] A Generalized Information-Theoretic Approach for Bounding the Number of Independent Sets in Bipartite Graphs
    Sason, Igal
    [J]. ENTROPY, 2021, 23 (03) : 1 - 14
  • [25] An Information-Theoretic Approach for Setting the Optimal Number of Decision Trees in Random Forests
    Cuzzocrea, Alfredo
    Francis, Shane Leo
    Gaber, Mohamed Medhat
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC 2013), 2013, : 1013 - 1019
  • [26] A geometric approach to information-theoretic private information retrieval
    Woodruff, D
    Yekhanin, S
    [J]. TWENTIETH ANNUAL IEEE CONFERENCE ON COMPUTATIONAL COMPLEXITY, PROCEEDINGS, 2005, : 275 - 284
  • [27] An information-theoretic approach to statistical dependence: Copula information
    Calsaverini, R. S.
    Vicente, R.
    [J]. EPL, 2009, 88 (06)
  • [28] A geometric approach to information-theoretic private information retrieval
    Woodruff, David
    Yekhanin, Sergey
    [J]. SIAM JOURNAL ON COMPUTING, 2007, 37 (04) : 1046 - 1056
  • [29] Information-Theoretic Analysis of OFDM With Subcarrier Number Modulation
    Dang, Shuping
    Guo, Shuaishuai
    Shihada, Basem
    Alouini, Mohamed-Slim
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2021, 67 (11) : 7338 - 7354
  • [30] An Information-theoretic approach for computational material modeling
    Furukawa, Tomonari
    Michopoulos, John G.
    [J]. ADVANCES IN FRACTURE AND MATERIALS BEHAVIOR, PTS 1 AND 2, 2008, 33-37 : 857 - +