On Perfect Clustering of High Dimension, Low Sample Size Data

被引:32
|
作者
Sarkar, Soham [1 ]
Ghosh, Anil K. [2 ]
机构
[1] Ecole Polytech Fed Lausanne, Inst Math, Stn 8, CH-1015 Lausanne, Switzerland
[2] Indian Stat Inst, Theoret Stat & Math Unit, 203 BT Rd, Kolkata 700108, India
关键词
Clustering algorithms; Indexes; Euclidean distance; Sociology; Statistics; Single photon emission computed tomography; Estimation; Dunn index; hierarchical clustering; high dimensional consistency; k-means clustering; pairwise distances; Rand index; LARGE NUMBERS; DATA SET; LAWS; PCA;
D O I
10.1109/TPAMI.2019.2912599
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Popular clustering algorithms based on usual distance functions (e.g., the Euclidean distance) often suffer in high dimension, low sample size (HDLSS) situations, where concentration of pairwise distances and violation of neighborhood structure have adverse effects on their performance. In this article, we use a new data-driven dissimilarity measure, called MADD, which takes care of these problems. MADD uses the distance concentration phenomenon to its advantage, and as a result, clustering algorithms based on MADD usually perform well for high dimensional data. We establish it using theoretical as well as numerical studies. We also address the problem of estimating the number of clusters. This is a challenging problem in cluster analysis, and several algorithms are available for it. We show that many of these existing algorithms have superior performance in high dimensions when they are constructed using MADD. We also construct a new estimator based on a penalized version of the Dunn index and prove its consistency in the HDLSS asymptotic regime. Several simulated and real data sets are analyzed to demonstrate the usefulness of MADD for cluster analysis of high dimensional data.
引用
收藏
页码:2257 / 2272
页数:16
相关论文
共 50 条
  • [1] CLUSTERING HIGH DIMENSION, LOW SAMPLE SIZE DATA USING THE MAXIMAL DATA PILING DISTANCE
    Ahn, Jeongyoun
    Lee, Myung Hee
    Yoon, Young Joo
    [J]. STATISTICA SINICA, 2012, 22 (02) : 443 - 464
  • [2] Discriminating Tensor Spectral Clustering for High-Dimension-Low-Sample-Size Data
    Hu, Yu
    Qi, Fei
    Cheung, Yiu-Ming
    Cai, Hongmin
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [3] Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data
    Liu, Yufeng
    Hayes, David Neil
    Nobel, Andrew
    Marron, J. S.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2008, 103 (483) : 1281 - 1293
  • [4] Geometric representation of high dimension, low sample size data
    Hall, P
    Marron, JS
    Neeman, A
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2005, 67 : 427 - 444
  • [5] Fuzzy clustering based classifier for extraction of individualities from high dimension low sample size data
    Sato-Ilic, Mika
    [J]. INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS, 2023, 17 (01): : 127 - 138
  • [6] Classification for high-dimension low-sample size data
    Shen, Liran
    Er, Meng Joo
    Yin, Qingbo
    [J]. PATTERN RECOGNITION, 2022, 130
  • [7] Classification for high-dimension low-sample size data
    Shen, Liran
    Er, Meng Joo
    Yin, Qingbo
    [J]. PATTERN RECOGNITION, 2022, 130
  • [8] Deep Neural Networks for High Dimension, Low Sample Size Data
    Liu, Bo
    Wei, Ying
    Zhang, Yu
    Yang, Qiang
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2287 - 2293
  • [9] Some considerations of classification for high dimension low-sample size data
    Zhang, Lingsong
    Lin, Xihong
    [J]. STATISTICAL METHODS IN MEDICAL RESEARCH, 2013, 22 (05) : 537 - 550
  • [10] Comparison of binary discrimination methods for high dimension low sample size data
    Bolivar-Cime, A.
    Marron, J. S.
    [J]. JOURNAL OF MULTIVARIATE ANALYSIS, 2013, 115 : 108 - 121