Fast Robust Correlation for High-Dimensional Data

被引:26
|
作者
Raymaekers, Jakob [1 ]
Rousseeuw, Peter J. [1 ]
机构
[1] Katholieke Univ Leuven, Dept Math, Leuven, Belgium
关键词
Anomaly detection; Cellwise outliers; Covariance matrix; Data transformation; Distance correlation; LOCATION; ALGORITHM;
D O I
10.1080/00401706.2019.1677270
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The product moment covariance matrix is a cornerstone of multivariate data analysis, from which one can derive correlations, principal components, Mahalanobis distances and many other results. Unfortunately, the product moment covariance and the corresponding Pearson correlation are very susceptible to outliers (anomalies) in the data. Several robust estimators of covariance matrices have been developed, but few are suitable for the ultrahigh-dimensional data that are becoming more prevalent nowadays. For that one needs methods whose computation scales well with the dimension, are guaranteed to yield a positive semidefinite matrix, and are sufficiently robust to outliers as well as sufficiently accurate in the statistical sense of low variability. We construct such methods using data transformations. The resulting approach is simple, fast, and widely applicable. We study its robustness by deriving influence functions and breakdown values, and computing the mean squared error on contaminated data. Using these results we select a method that performs well overall. This also allows us to construct a faster version of the DetectDeviatingCells method (Rousseeuw and Van den Bossche 2018) to detect cellwise outliers, which can deal with much higher dimensions. The approach is illustrated on genomic data with 12,600 variables and color video data with 920,000 dimensions. for this article are available online.
引用
收藏
页码:184 / 198
页数:15
相关论文
共 50 条
  • [1] Robust PCA for high-dimensional data
    Hubert, M
    Rousseeuw, PJ
    Verboven, S
    [J]. DEVELOPMENTS IN ROBUST STATISTICS, 2003, : 169 - 179
  • [2] Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data
    Serra, Angela
    Coretto, Pietro
    Fratello, Michele
    Tagliaferri, Roberto
    [J]. BIOINFORMATICS, 2018, 34 (04) : 625 - 634
  • [3] Robust Ridge Regression for High-Dimensional Data
    Maronna, Ricardo A.
    [J]. TECHNOMETRICS, 2011, 53 (01) : 44 - 53
  • [4] RaCoCl: Robust Rank Correlation Based Clustering - An Exploratory Study for High-Dimensional Data
    Krone, Martin
    Klawonn, Frank
    Jayaram, Balasubramaniam
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ - IEEE 2013), 2013,
  • [5] Fast Robust Model Predictive Control of High-dimensional Systems
    Foguth, Lucas C.
    Paulson, Joel A.
    Braatz, Richard D.
    Raimondo, Davide M.
    [J]. 2015 EUROPEAN CONTROL CONFERENCE (ECC), 2015, : 2009 - 2014
  • [6] On Coupling Robust Estimation with Regularization for High-Dimensional Data
    Kalina, Jan
    Hlinka, Jaroslav
    [J]. DATA SCIENCE: INNOVATIVE DEVELOPMENTS IN DATA ANALYSIS AND CLUSTERING, 2017, : 15 - 27
  • [7] Robust high-dimensional regression for data with anomalous responses
    Mingyang Ren
    Sanguo Zhang
    Qingzhao Zhang
    [J]. Annals of the Institute of Statistical Mathematics, 2021, 73 : 703 - 736
  • [8] Parallel computation of high-dimensional robust correlation and covariance matrices
    Chilson, James
    Ng, Raymond
    Wagner, Alan
    Zamar, Ruben
    [J]. ALGORITHMICA, 2006, 45 (03) : 403 - 431
  • [9] Robust high-dimensional regression for data with anomalous responses
    Ren, Mingyang
    Zhang, Sanguo
    Zhang, Qingzhao
    [J]. ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 2021, 73 (04) : 703 - 736
  • [10] A robust variable screening method for high-dimensional data
    Wang, Tao
    Zheng, Lin
    Li, Zhonghua
    Liu, Haiyang
    [J]. JOURNAL OF APPLIED STATISTICS, 2017, 44 (10) : 1839 - 1855