Multiple outliers detection in sparse high-dimensional regression

被引:13
|
作者
Wang, Tao [1 ,2 ,3 ]
Li, Qun [4 ]
Chen, Bin [5 ]
Li, Zhonghua [1 ,2 ]
机构
[1] Nankai Univ, Inst Stat, Tianjin 300071, Peoples R China
[2] Nankai Univ, LPMC, Tianjin 300071, Peoples R China
[3] Huaiyin Normal Univ, Sch Math Sci, Huaian, Peoples R China
[4] Nankai Univ, Sch Math Sci, Tianjin, Peoples R China
[5] Jiangsu Normal Univ, Sch Math & Stat, Xuzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
High-dimensional linear regression; least trimmed square; multiple hypothesis testing; multiple outliers detection; 62J20; 62H15; 62J05; 62F35; HIGH BREAKDOWN-POINT; LARGE DATA SETS; SQUARES REGRESSION; LINEAR-REGRESSION; INFLUENTIAL OBSERVATIONS; IDENTIFICATION; SCALE;
D O I
10.1080/00949655.2017.1379521
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The presence of outliers would inevitably lead to distorted analysis and inappropriate prediction, especially for multiple outliers in high-dimensional regression, where the high dimensionality of the data might amplify the chance of an observation or multiple observations being outlying. Noting that the detection of outliers is not only necessary but also important in high-dimensional regression analysis, we, in this paper, propose a feasible outlier detection approach in sparse high-dimensional linear regression model. Firstly, we search a clean subset by use of the sure independence screening method and the least trimmed square regression estimates. Then, we define a high-dimensional outlier detection measure and propose a multiple outliers detection approach through multiple testing procedures. In addition, to enhance efficiency, we refine the outlier detection rule after obtaining a relatively reliable non-outlier subset based on the initial detection approach. By comparison studies based on Monte Carlo simulation, it is shown that the proposed method performs well for detecting multiple outliers in sparse high-dimensional linear regression model. We further illustrate the application of the proposed method by empirical analysis of a real-life protein and gene expression data.
引用
收藏
页码:89 / 107
页数:19
相关论文
共 50 条
  • [41] Sequential Analysis in High-Dimensional Multiple Testing and Sparse Recovery
    Malloy, Matthew
    Nowak, Robert
    [J]. 2011 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY PROCEEDINGS (ISIT), 2011, : 2661 - 2665
  • [42] High-dimensional sparse MANOVA
    Cai, T. Tony
    Xia, Yin
    [J]. JOURNAL OF MULTIVARIATE ANALYSIS, 2014, 131 : 174 - 196
  • [43] Detecting and ranking outliers in high-dimensional data
    Kaur, Amardeep
    Datta, Amitava
    [J]. INTERNATIONAL JOURNAL OF ADVANCES IN ENGINEERING SCIENCES AND APPLIED MATHEMATICS, 2019, 11 (01) : 75 - 87
  • [44] Hiding outliers in high-dimensional data spaces
    Steinbuss G.
    Böhm K.
    [J]. International Journal of Data Science and Analytics, 2017, 4 (3) : 173 - 189
  • [45] MINIMAX RATES IN SPARSE, HIGH-DIMENSIONAL CHANGE POINT DETECTION
    Liu, Haoyang
    Gao, Chao
    Samworth, Richard J.
    [J]. ANNALS OF STATISTICS, 2021, 49 (02): : 1081 - 1112
  • [46] Hyperspherical Sparse Approximation Techniques for High-Dimensional Discontinuity Detection
    Zhang, Guannan
    Webster, Clayton G.
    Gunzburger, Max
    Burkardt, John
    [J]. SIAM REVIEW, 2016, 58 (03) : 517 - 551
  • [47] High-Dimensional Interactions Detection with Sparse Principal Hessian Matrix
    Tang, Cheng Yong
    Fang, Ethan X.
    Dong, Yuexiao
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2020, 21
  • [48] High-dimensional interactions detection with sparse principal hessian matrix
    Tang, Cheng Yong
    Fang, Ethan X.
    Dong, Yuexiao
    [J]. Journal of Machine Learning Research, 2020, 21
  • [49] Detecting and ranking outliers in high-dimensional data
    Amardeep Kaur
    Amitava Datta
    [J]. International Journal of Advances in Engineering Sciences and Applied Mathematics, 2019, 11 : 75 - 87
  • [50] Sparse least trimmed squares regression with compositional covariates for high-dimensional data
    Monti, Gianna Serafina
    Filzmoser, Peter
    [J]. BIOINFORMATICS, 2021, 37 (21) : 3805 - 3814