Machine-learned cluster identification in high-dimensional data

被引:56
|
作者
Ultsch, Alfred [1 ]
Loetsch, Joern [2 ,3 ]
机构
[1] Univ Marburg, DataBion Res Grp, Hans Meerwein Str, D-35032 Marburg, Germany
[2] Goethe Univ, Inst Clin Pharmacol, Theodor Stern Kai 7, D-60590 Frankfurt, Germany
[3] Fraunhofer Inst Mol Biol & Appl Ecol, Project Grp Translat Med & Pharmacol IME TMP, Theodor Stern Kai 7, D-60590 Frankfurt, Germany
关键词
Machine-learning; Clustering; SELF-ORGANIZING MAPS; GENE-EXPRESSION; DISCOVERY; CANCER;
D O I
10.1016/j.jbi.2016.12.011
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background: High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). Methods: Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. Results: Ward clustering imposed cluster structures on cluster-less "golf ball", "cuboid" and "S-shaped" data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. Conclusions: The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data. (C) 2017 The Authors. Published by Elsevier Inc.
引用
收藏
页码:95 / 104
页数:10
相关论文
共 50 条
  • [1] Minimum standards for evaluating machine-learned models of high-dimensional data
    Chen, Brian H.
    [J]. FRONTIERS IN AGING, 2022, 3
  • [2] Efficient high-dimensional variational data assimilation with machine-learned reduced-order models
    Maulik, Romit
    Rao, Vishwas
    Wang, Jiali
    Mengaldo, Gianmarco
    Constantinescu, Emil
    Lusch, Bethany
    Balaprakash, Prasanna
    Foster, Ian
    Kotamarthi, Rao
    [J]. GEOSCIENTIFIC MODEL DEVELOPMENT, 2022, 15 (08) : 3433 - 3445
  • [3] Machine-learned pattern identification in olfactory subtest results
    Jörn Lötsch
    Thomas Hummel
    Alfred Ultsch
    [J]. Scientific Reports, 6
  • [4] Machine-learned pattern identification in olfactory subtest results
    Lotsch, Jorn
    Hummel, Thomas
    Ultsch, Alfred
    [J]. SCIENTIFIC REPORTS, 2016, 6
  • [5] Lessons learned in the analysis of high-dimensional data in vaccinomics
    Oberg, Ann L.
    McKinney, Brett A.
    Schaid, Daniel J.
    Pankratz, V. Shane
    Kennedy, Richard B.
    Poland, Gregory A.
    [J]. VACCINE, 2015, 33 (40) : 5262 - 5270
  • [6] Cluster analysis of high-dimensional data: A case study
    Bean, R
    McLachlan, G
    [J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING IDEAL 2005, PROCEEDINGS, 2005, 3578 : 302 - 310
  • [7] A Density Peak Cluster Model of High-Dimensional Data
    Jin, Cong
    Xie, Xi
    Hu, Fei
    [J]. ADVANCES IN SERVICES COMPUTING, 2016, 10065 : 220 - 227
  • [8] Cluster PCA for outliers detection in high-dimensional data
    Stefatos, George
    Ben Hamza, A.
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-8, 2007, : 3961 - 3966
  • [9] A Visual Method for High-Dimensional Data Cluster Exploration
    Zhang, Ke-Bing
    Huang, Mao Lin
    Orgun, Mehmet A.
    Nguyen, Quang Vinh
    [J]. NEURAL INFORMATION PROCESSING, PT 2, PROCEEDINGS, 2009, 5864 : 699 - +
  • [10] Robust regularized cluster analysis for high-dimensional data
    Kalina, Jan
    Vlckova, Katarina
    [J]. MATHEMATICAL METHODS IN ECONOMICS (MME 2014), 2014, : 378 - 383