Fast Robust Model Selection in Large Datasets

被引:9
|
作者
Dupuis, Debbie J. [1 ]
Victoria-Feser, Maria-Pia [2 ,3 ]
机构
[1] HEC Montreal, Dept Management Sci, Montreal, PQ H3T 2A7, Canada
[2] Univ Geneva, Res Ctr Stat, CH-1211 Geneva, Switzerland
[3] Univ Geneva, HEC Geneve, CH-1211 Geneva, Switzerland
基金
加拿大自然科学与工程研究理事会; 瑞士国家科学基金会;
关键词
False discovery rate; Least angle regression; Linear regression; M-estimator; Multicollinearity; Partial correlation; Random forests; Robust t test; FALSE DISCOVERY RATE; LINEAR-MODEL; VARIABLE SELECTION;
D O I
10.1198/jasa.2011.tm09650
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Large datasets are increasingly common in many research fields. In particular, in the linear regression context, it is often the case that a huge number of potential covariates are available to explain a response variable, and the first step of a reasonable statistical analysis is to reduce the number of covariates. This can be clone in a forward selection procedure that includes selecting the variable to enter, deciding to retain it or stop the selection, and estimating the augmented model. Least squares plus t tests can be fast, but the outcome of a forward selection might be suboptimal when there are outliers. In this article we propose a complete algorithm for fast robust model selection, including considerations for huge sample sizes. Because simply replacing the classical statistical criteria with robust ones is not computationally possible, we develop simplified robust estimators, selection criteria, and testing procedures for linear regression. The robust estimator is a one-step weighted M-estimator that can be biased if the covariates are not orthogonal. We show that the bias can be made smaller by iterating the M-estimator one or more steps further. In the variable selection process, we propose a simplified robust criterion based on a robust t statistic that we compare with a false discovery rate-adjusted level. We carry out a simulation study to show the good performance of our approach. We also analyze two datasets and show that the results obtained by our method outperform those from robust least angle regression and random forests. Supplemental materials are available online.
引用
收藏
页码:203 / 212
页数:10
相关论文
共 50 条
  • [1] Fast robust variable selection using VIF regression in large datasets
    Seo, Han Son
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2018, 31 (04) : 463 - 473
  • [2] FAST AND ROBUST BOOTSTRAP IN ANALYSING LARGE MULTIVARIATE DATASETS
    Basiri, Shahab
    Ollila, Esa
    Koivunen, Visa
    [J]. CONFERENCE RECORD OF THE 2014 FORTY-EIGHTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, 2014, : 8 - 13
  • [3] Robust model selection using fast and robust bootstrap
    Salibian-Barrera, Matlas
    Van Aelst, Stefan
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (12) : 5121 - 5135
  • [4] A fast-prediction surrogate model for large datasets
    Hwang, John T.
    Martins, Joaquim R. R. A.
    [J]. AEROSPACE SCIENCE AND TECHNOLOGY, 2018, 75 : 74 - 87
  • [5] Fast model selection for robust calibration methods
    Engelen, S
    Hubert, M
    [J]. ANALYTICA CHIMICA ACTA, 2005, 544 (1-2) : 219 - 228
  • [6] Decision tree induction using a fast splitting attribute selection for large datasets
    Franco-Arcega, A.
    Carrasco-Ochoa, J. A.
    Sanchez-Diaz, G.
    Fco Martinez-Trinidad, J.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (11) : 14290 - 14300
  • [7] GenoCore: A simple and fast algorithm for core subset selection from large genotype datasets
    Jeong, Seongmun
    Kim, Jae-Yoon
    Jeong, Soon-Chun
    Kang, Sung-Taeg
    Moon, Jung-Kyung
    Kim, Namshin
    [J]. PLOS ONE, 2017, 12 (07):
  • [8] Picube for Fast Exploration of Large Datasets
    Fu, Wenxiao
    [J]. 2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, : 2069 - 2073
  • [9] Using Proximity Graph Cut for Fast and Robust Instance-Based Classification in Large Datasets
    Protasov, Stanislav
    Khan, Adil Mehmood
    [J]. COMPLEXITY, 2021, 2021
  • [10] A FAST MODEL SELECTION PROCEDURE FOR LARGE FAMILIES OF MODELS
    EDWARDS, D
    HAVRANEK, T
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1987, 82 (397) : 205 - 213