Partition clustering of high dimensional low sample size data based on p-values

被引:7
|
作者
von Borries, George [2 ]
Wang, Haiyan [1 ]
机构
[1] Kansas State Univ, Dept Stat, Manhattan, KS 66506 USA
[2] Univ Brasilia, Dept Estat, IE, BR-70910900 Brasilia, DF, Brazil
关键词
FALSE DISCOVERY RATE; VARIANCE; NUMBER; ANOVA;
D O I
10.1016/j.csda.2009.06.012
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Clustering techniques play an important role in analyzing high dimensional data that is common in high-throughput screening such as microarray and mass spectrometry data. Effective use of the high dimensionality and some replications can help to increase clustering accuracy and stability. In this article a new partitioning algorithm with a robust distance measure is introduced to cluster variables in high dimensional low sample size (HDLSS) data that contain a large number of independent variables with a small number of replications per variable. The proposed clustering algorithm, PPCLUST, considers data from a mixture distribution and uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity to separate the mixture components. PPCLUST is able to efficiently cluster a large number of variables in the presence of very few replications. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. Numerical studies and an application to microarray gene expression data for colorectal cancer study are discussed. Published by Elsevier B.V.
引用
收藏
页码:3987 / 3998
页数:12
相关论文
共 50 条
  • [41] Distance-based outlier detection for high dimension, low sample size data
    Ahn, Jeongyoun
    Lee, Myung Hee
    Lee, Jung Ae
    [J]. JOURNAL OF APPLIED STATISTICS, 2019, 46 (01) : 13 - 29
  • [42] An Empirical Study of Several Information Theoretic Based Feature Extraction Methods for Classifying High Dimensional Low Sample Size Data
    Verghese, Sheena Leeza
    Liao, Iman Yi
    Maul, Tomas H.
    Chong, Siang Yew
    [J]. IEEE ACCESS, 2021, 9 : 69157 - 69172
  • [43] Density estimation-based method to determine sample size for random sample partition of big data
    Yulin He
    Jiaqi Chen
    Jiaxing Shen
    Philippe Fournier-Viger
    Joshua Zhexue Huang
    [J]. Frontiers of Computer Science, 2024, 18
  • [44] An Effective Feature Selection Method Based on Pair-Wise Feature Proximity for High Dimensional Low Sample Size Data
    Happy, S. L.
    Mohanty, Ramanarayan
    Routray, Aurobinda
    [J]. 2017 25TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2017, : 1574 - 1578
  • [45] P-values in genomics: Apparent precision masks high uncertainty
    L C Lazzeroni
    Y Lu
    I Belitskaya-Lévy
    [J]. Molecular Psychiatry, 2014, 19 : 1336 - 1340
  • [46] Density estimation-based method to determine sample size for random sample partition of big data
    He, Yulin
    Chen, Jiaqi
    Shen, Jiaxing
    Fournier-Viger, Philippe
    Huang, Joshua Zhexue
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (05)
  • [47] Efficient Utility-based Clustering over High Dimensional Partition Spaces
    Liverani, Silvia
    Anderson, Paul E.
    Edwards, Kieron D.
    Millar, Andrew J.
    Smith, Jim Q.
    [J]. BAYESIAN ANALYSIS, 2009, 4 (03): : 539 - 572
  • [48] P-values in genomics: Apparent precision masks high uncertainty
    Lazzeroni, L. C.
    Lu, Y.
    Belitskaya-Levy, I.
    [J]. MOLECULAR PSYCHIATRY, 2014, 19 (12) : 1336 - 1340
  • [49] Geometric representation of high dimension, low sample size data
    Hall, P
    Marron, JS
    Neeman, A
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2005, 67 : 427 - 444
  • [50] Separability tests for high-dimensional, low-sample size multivariate repeated measures data
    Simpson, Sean L.
    Edwards, Lloyd J.
    Styner, Martin A.
    Muller, Keith E.
    [J]. JOURNAL OF APPLIED STATISTICS, 2014, 41 (11) : 2450 - 2461