Fast Robust Model Selection in Large Datasets

被引：9

作者：

Dupuis, Debbie J. ^{[1
]}

Victoria-Feser, Maria-Pia ^{[2
,3
]}

机构：

[1] HEC Montreal, Dept Management Sci, Montreal, PQ H3T 2A7, Canada

[2] Univ Geneva, Res Ctr Stat, CH-1211 Geneva, Switzerland

[3] Univ Geneva, HEC Geneve, CH-1211 Geneva, Switzerland

来源：

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION | 2011年 / 106卷 / 493期

基金：

加拿大自然科学与工程研究理事会; 瑞士国家科学基金会;

关键词：

False discovery rate; Least angle regression; Linear regression; M-estimator; Multicollinearity; Partial correlation; Random forests; Robust t test; FALSE DISCOVERY RATE; LINEAR-MODEL; VARIABLE SELECTION;

D O I：

10.1198/jasa.2011.tm09650

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Large datasets are increasingly common in many research fields. In particular, in the linear regression context, it is often the case that a huge number of potential covariates are available to explain a response variable, and the first step of a reasonable statistical analysis is to reduce the number of covariates. This can be clone in a forward selection procedure that includes selecting the variable to enter, deciding to retain it or stop the selection, and estimating the augmented model. Least squares plus t tests can be fast, but the outcome of a forward selection might be suboptimal when there are outliers. In this article we propose a complete algorithm for fast robust model selection, including considerations for huge sample sizes. Because simply replacing the classical statistical criteria with robust ones is not computationally possible, we develop simplified robust estimators, selection criteria, and testing procedures for linear regression. The robust estimator is a one-step weighted M-estimator that can be biased if the covariates are not orthogonal. We show that the bias can be made smaller by iterating the M-estimator one or more steps further. In the variable selection process, we propose a simplified robust criterion based on a robust t statistic that we compare with a false discovery rate-adjusted level. We carry out a simulation study to show the good performance of our approach. We also analyze two datasets and show that the results obtained by our method outperform those from robust least angle regression and random forests. Supplemental materials are available online.

引用

页码：203 / 212

页数：10

共 50 条

[21] Fast and robust analysis of dynamic contrast enhanced MRI datasets
Kubassova, Olga
Boesen, Mikael
Boyle, Roger D.
Cimmino, Marco A.
Jensen, Karl E.
Bliddal, Henning
Radjenovic, Alexandra
[J]. MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION- MICCAI 2007, PT 2, PROCEEDINGS, 2007, 4792 : 261 - +
[22] Fast Local Support Vector Machines for Large Datasets
Segata, Nicola
Blanzieri, Enrico
[J]. MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, 2009, 5632 : 295 - +
[23] A Fast SVM Training Method for Very Large Datasets
Li, Boyang
Wang, Qiangwei
Hu, Jinglu
[J]. IJCNN: 2009 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1- 6, 2009, : 277 - 282
[24] FTSPlot: Fast Time Series Visualization for Large Datasets
Riss, Michael
[J]. PLOS ONE, 2014, 9 (04):
[25] Fast methods for training Gaussian processes on large datasets
Moore, C. J.
Chua, A. J. K.
Berry, C. P. L.
Gair, J. R.
[J]. ROYAL SOCIETY OPEN SCIENCE, 2016, 3 (05):
[26] Space efficient fast isosurface extraction for large datasets
Bordoloi, UD
Shen, HW
[J]. IEEE VISUALIZATION 2003, PROCEEDINGS, 2003, : 201 - 208
[27] FAST DETERMINATION OF ATTENUATION FROM MICROSEISMICITY FOR LARGE DATASETS
Wcislo, Milosz
Eisner, Leo
[J]. ACTA GEODYNAMICA ET GEOMATERIALIA, 2019, 16 (03): : 257 - 268
[28] AGORAS: A Fast Algorithm for Estimating Medoids in Large Datasets
Rangel, Esteban M.
Hendrix, William
Agrawal, Ankit
Liao, Wei-keng
Choudhary, Alok
[J]. INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE 2016 (ICCS 2016), 2016, 80 : 1159 - 1169
[29] Ant Colony Clusters for Fast Execution of Large Datasets
Sreenu, Konda
Reddy, Boddu Raja Srinivasa
[J]. ARTIFICIAL INTELLIGENCE AND EVOLUTIONARY COMPUTATIONS IN ENGINEERING SYSTEMS, ICAIECES 2017, 2018, 668 : 271 - 280
[30] Subset selection from large datasets for Kriging modeling
Rennen, Gijs
[J]. STRUCTURAL AND MULTIDISCIPLINARY OPTIMIZATION, 2009, 38 (06) : 545 - 569

← 1 2 3 4 5 →