Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data

被引:18
|
作者
Serra, Angela [1 ]
Coretto, Pietro [2 ]
Fratello, Michele [3 ]
Tagliaferri, Roberto [1 ]
机构
[1] Univ Salerno, Dept Management & Innovat Syst, NeuRoNeLab, I-84084 Fisciano, Sa, Italy
[2] Univ Salerno, Dept Econ & Stat, STATLAB, I-84084 Fisciano, Sa, Italy
[3] Second Univ Napoli, Dept Med Surg Neurol Metab & Ageing Sci, Piazza Luigi Miraglia 2, I-80138 Naples, Italy
关键词
GENE-EXPRESSION DATA; CLUSTER-ANALYSIS; COVARIANCE; SELECTION; NUMBER; NOISE;
D O I
10.1093/bioinformatics/btx642
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Microarray technology can be used to study the expression of thousands of genes across a number of different experimental conditions, usually hundreds. The underlying principle is that genes sharing similar expression patterns, across different samples, can be part of the same co-expression system, or they may share the same biological functions. Groups of genes are usually identified based on cluster analysis. Clustering methods rely on the similarity matrix between genes. A common choice to measure similarity is to compute the sample correlation matrix. Dimensionality reduction is another popular data analysis task which is also based on covariance/correlation matrix estimates. Unfortunately, covariance/correlation matrix estimation suffers from the intrinsic noise present in high-dimensional data. Sources of noise are: sampling variations, presents of outlying sample units, and the fact that in most cases the number of units is much larger than the number of genes. Results: In this paper, we propose a robust correlation matrix estimator that is regularized based on adaptive thresholding. The resulting method jointly tames the effects of the high-dimensionality, and data contamination. Computations are easy to implement and do not require hand tunings. Both simulated and real data are analyzed. A Monte Carlo experiment shows that the proposed method is capable of remarkable performances. Our correlation metric is more robust to outliers compared with the existing alternatives in two gene expression datasets. It is also shown how the regularization allows to automatically detect and filter spurious correlations. The same regularization is also extended to other less robust correlation measures. Finally, we apply the ARACNE algorithm on the SyNTreN gene expression data. Sensitivity and specificity of the reconstructed network is compared with the gold standard. We show that ARACNE performs better when it takes the proposed correlation matrix estimator as input.
引用
收藏
页码:625 / 634
页数:10
相关论文
共 50 条
  • [21] Robust Statistical Inference for High-Dimensional Data Models with Application to Genomics
    Sen, Pranab Kumar
    [J]. AUSTRIAN JOURNAL OF STATISTICS, 2006, 35 (2-3) : 197 - 214
  • [22] Estimation of high-dimensional vector autoregression via sparse precision matrix
    Poignard, Benjamin
    Asai, Manabu
    [J]. ECONOMETRICS JOURNAL, 2023, 26 (02): : 307 - 326
  • [23] ROBUST SHAPE MATRIX ESTIMATION FOR HIGH-DIMENSIONAL COMPOSITIONAL DATA WITH APPLICATION TO MICROBIAL INTER-TAXA ANALYSIS
    Li, Danning
    Srinivasan, Arun
    Xue, Lingzhou
    Zhan, Xiang
    [J]. STATISTICA SINICA, 2023, 33 : 1577 - 1602
  • [24] Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data
    Yoshida, Kosuke
    Yoshimoto, Junichiro
    Doya, Kenji
    [J]. BMC BIOINFORMATICS, 2017, 18
  • [25] Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data
    Kosuke Yoshida
    Junichiro Yoshimoto
    Kenji Doya
    [J]. BMC Bioinformatics, 18
  • [26] Principal component analysis for sparse high-dimensional data
    Raiko, Tapani
    Ilin, Alexander
    Karhunen, Juha
    [J]. NEURAL INFORMATION PROCESSING, PART I, 2008, 4984 : 566 - 575
  • [27] Sparse meta-analysis with high-dimensional data
    He, Qianchuan
    Zhang, Hao Helen
    Avery, Christy L.
    Lin, D. Y.
    [J]. BIOSTATISTICS, 2016, 17 (02) : 205 - 220
  • [28] Robust and sparse k-means clustering for high-dimensional data
    Brodinova, Sarka
    Filzmoser, Peter
    Ortner, Thomas
    Breiteneder, Christian
    Rohm, Maia
    [J]. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2019, 13 (04) : 905 - 932
  • [29] Robust and sparse k-means clustering for high-dimensional data
    Šárka Brodinová
    Peter Filzmoser
    Thomas Ortner
    Christian Breiteneder
    Maia Rohm
    [J]. Advances in Data Analysis and Classification, 2019, 13 : 905 - 932
  • [30] Robust Testing in High-Dimensional Sparse Models
    George, Anand Jerry
    Canonne, Clement L.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,