Some remarks on the R2 for clustering

被引:4
|
作者
Loperfido, Nicola [1 ]
Tarpey, Thaddeus [2 ]
机构
[1] Univ Urbino Carlo Bo, Dipartimento Econ Soci & Polit, Urbino, Italy
[2] Wright State Univ, Dept Math & Stat, 3640 Colonel Glenn Hwy, Dayton, OH 45435 USA
关键词
high-dimensional data; k-means clustering; multiple regression; skewness; IMPROVED APPROXIMATION; PRINCIPAL POINTS; DATA SET; ALGORITHM; SELECTION; NUMBER; SUM; VARIABLES; SKEWNESS; CRITERIA;
D O I
10.1002/sam.11378
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A common descriptive statistic in cluster analysis is the R-2 that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the R-2 for clustering. In particular, we show that generally the R-2 can be artificially inflated by linearly transforming the data by stretching and by projecting. Also, the R-2 for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the R-2 for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering R-2, especially in high-dimensional settings. A functional data example is given showing how that R-2 for clustering can vary dramatically depending on how the curves are estimated.
引用
收藏
页码:135 / 148
页数:14
相关论文
共 50 条