Estimating the number of clusters in a numerical data set via quantization error modeling

被引:40
|
作者
Kolesnikov, Alexander [1 ]
Trichina, Elena [2 ]
Kauranne, Tuomo [3 ]
机构
[1] Arbonaut Ltd, Joertsuu, Finland
[2] Univ Eastern Finland, Joensuu, Finland
[3] Lappeenranta Univ Technol, Lappeenranta, Finland
关键词
Clustering; Number of clusters; Vector quantization; Color quantization; Dominant colors; Fractal dimensions; ALGORITHM;
D O I
10.1016/j.patcog.2014.09.017
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we consider the problem of unsupervised clustering (vector quantization) of multidimensional numerical data. We propose a new method for determining an optimal number of clusters in the data set. The method is based on parametric modeling of the quantization error. The model parameter can be treated as the effective dimensionality of the data set. The proposed method was tested with artificial and real numerical data sets and the results of the experiments demonstrate empirically not only the effectiveness of the method but its ability to cope with difficult cases where other known methods fail. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:941 / 952
页数:12
相关论文
共 50 条
  • [31] A new similarity measure and its use in determining the number of clusters in a multivariate data set
    Vassiliou, A
    Tambouratzis, DG
    Koutras, MV
    Bersimis, S
    [J]. COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2004, 33 (07) : 1643 - 1666
  • [32] A Method to Find Optimum Number of Clusters Based on Fuzzy Silhouette on Dynamic Data Set
    Subbalakshmi, Chatti
    Krishna, G. Rama
    Rao, S. Krishna Mohan
    Rao, P. Venketeswa
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES, ICICT 2014, 2015, 46 : 346 - 353
  • [33] A hierarchical Gamma Mixture Model-based method for estimating the number of clusters in complex data
    Azhar, Muhammad
    Huang, Joshua Zhexue
    Masud, Md Abdul
    Li, Mark Junjie
    Cui, Laizhong
    [J]. APPLIED SOFT COMPUTING, 2020, 87
  • [34] Evaluation of the number of clusters in a data set using p-values from multiple tests of hypotheses
    Modak, Soumita
    [J]. COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2024,
  • [35] Estimating the Baseline Error of Wide-Swath Altimeters Using Nadir Altimeters via Numerical Simulation
    MIAO Xiangying
    JIA Yongjun
    LIN Mingsen
    MIAO Hongli
    [J]. Journal of Ocean University of China, 2022, 21 (03) : 681 - 693
  • [36] Estimating the Baseline Error of Wide-Swath Altimeters Using Nadir Altimeters via Numerical Simulation
    Miao Xiangying
    Jia Yongjun
    Lin Mingsen
    Miao Hongli
    [J]. JOURNAL OF OCEAN UNIVERSITY OF CHINA, 2022, 21 (03) : 681 - 693
  • [37] Estimating the Baseline Error of Wide-Swath Altimeters Using Nadir Altimeters via Numerical Simulation
    Xiangying Miao
    Yongjun Jia
    Mingsen Lin
    Hongli Miao
    [J]. Journal of Ocean University of China, 2022, 21 : 681 - 693
  • [38] CytoSet: Predicting clinical outcomes via set-modeling of cytometry data
    Yi, Haidong
    Stanley, Natalie
    [J]. 12TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS (ACM-BCB 2021), 2021,
  • [39] Data-driven selection of the number of change-points via error rate control
    Chen, Hui
    Ren, Haojie
    Yao, Fang
    Zou, Changliang
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2023, 118 (542) : 1415 - 1428
  • [40] Estimating Aqueous Nanofluids Viscosity via GEP Modeling: Correlation Development and Data Assessment
    Mahdaviara, Mehdi
    Rostami, Alireza
    Shahbazi, Khalil
    Shokrollahi, Amin
    Ghazanfari, Mohammad Hossein
    [J]. IRANIAN JOURNAL OF CHEMISTRY & CHEMICAL ENGINEERING-INTERNATIONAL ENGLISH EDITION, 2022, 41 (01): : 266 - 283