Sparse regression for large data sets with outliers

被引:16
|
作者
Bottmer, Lea [1 ,3 ]
Croux, Christophe [2 ]
Wilms, Ines [3 ]
机构
[1] Stanford Univ, Dept Econ, Stanford, CA 94305 USA
[2] EDHEC Business Sch, Paris, France
[3] Maastricht Univ, Dept Quantitat Econ, Maastricht, Netherlands
关键词
Data science; Lasso; Outliers; Robust regression; Variable selection; HIGH-DIMENSIONAL DATA; SELECTION; ROBUST; REGULARIZATION; SALES; INFORMATION; MODELS;
D O I
10.1016/j.ejor.2021.05.049
中图分类号
C93 [管理学];
学科分类号
12 ; 1201 ; 1202 ; 120202 ;
摘要
The linear regression model remains an important workhorse for data scientists. However, many data sets contain many more predictors than observations. Besides, outliers, or anomalies, frequently occur. This paper proposes an algorithm for regression analysis that addresses these features typical for big data sets, which we call "sparse shooting S". The resulting regression coefficients are sparse, meaning that many of them are set to zero, hereby selecting the most relevant predictors. A distinct feature of the method is its robustness with respect to outliers in the cells of the data matrix. The excellent performance of this robust variable selection and prediction method is shown in a simulation study. A real data application on car fuel consumption demonstrates its usefulness. (c) 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
引用
收藏
页码:782 / 794
页数:13
相关论文
共 50 条
  • [1] Distributed Strategies for Mining Outliers in Large Data Sets
    Angiulli, Fabrizio
    Basta, Stefano
    Lodi, Stefano
    Sartori, Claudio
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (07) : 1520 - 1532
  • [2] A Distributed Approach to Detect Outliers in Very Large Data Sets
    Angiulli, Fabrizio
    Basta, Stefano
    Lodi, Stefano
    Sartori, Claudio
    [J]. EURO-PAR 2010 PARALLEL PROCESSING, PT I, 2010, 6271 : 329 - +
  • [3] Efficient algorithms for mining outliers from large data sets
    Ramaswamy, S
    Rastogi, R
    Shim, K
    [J]. SIGMOD RECORD, 2000, 29 (02) : 427 - 438
  • [4] SPARSE LEAST TRIMMED SQUARES REGRESSION FOR ANALYZING HIGH-DIMENSIONAL LARGE DATA SETS
    Alfons, Andreas
    Croux, Christophe
    Gelper, Sarah
    [J]. ANNALS OF APPLIED STATISTICS, 2013, 7 (01): : 226 - 248
  • [5] ON THE DETECTION OF MULTIVARIATE DATA OUTLIERS AND REGRESSION OUTLIERS
    LAZRAQ, A
    CLEROUX, R
    [J]. DATA ANALYSIS, LEARNING SYMBOLIC AND NUMERIC KNOWLEDGE, 1989, : 133 - 140
  • [6] LOCATION OF SEVERAL OUTLIERS IN MULTIPLE-REGRESSION DATA USING ELEMENTAL SETS
    HAWKINS, DM
    BRADU, D
    KASS, GV
    [J]. TECHNOMETRICS, 1984, 26 (03) : 197 - 208
  • [7] Computing LTS regression for large data sets
    Rousseeuw, PJ
    Van Driessen, K
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2006, 12 (01) : 29 - 45
  • [8] Computing LTS Regression for Large Data Sets
    PETER J. ROUSSEEUW
    KATRIEN VAN DRIESSEN
    [J]. Data Mining and Knowledge Discovery, 2006, 12 : 29 - 45
  • [9] CONFIDENCE SETS IN SPARSE REGRESSION
    Nickl, Richard
    van de Geer, Sara
    [J]. ANNALS OF STATISTICS, 2013, 41 (06): : 2852 - 2876
  • [10] Consistent Estimation for PCA and Sparse Regression with Oblivious Outliers
    d'Orsi, Tommaso
    Liu, Chih-Hung
    Nasser, Rajai
    Novikov, Gleb
    Steurer, David
    Tiegel, Stefan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34