High-dimensional regression with potential prior information on variable importance

被引:0
|
作者
Stokell, Benjamin G. [1 ]
Shah, Rajen D. [1 ]
机构
[1] Univ Cambridge, Cambridge, England
基金
英国工程与自然科学研究理事会;
关键词
High-dimensional data; Low variance filter; Lasso; Ridge regression; Missing data; Corrupted data; SELECTION; LASSO;
D O I
10.1007/s11222-022-10110-5
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include the ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number M of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a log M price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and in time series settings. An R package is available on github.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Consistent Bayesian information criterion based on a mixture prior for possibly high-dimensional multivariate linear regression models
    Kono, Haruki
    Kubokawa, Tatsuya
    SCANDINAVIAN JOURNAL OF STATISTICS, 2023, 50 (03) : 1022 - 1047
  • [22] High-dimensional local polynomial regression with variable selection and dimension reduction
    Cheung, Kin Yap
    Lee, Stephen M. S.
    STATISTICS AND COMPUTING, 2024, 34 (01)
  • [23] Variable selection in high-dimensional sparse multiresponse linear regression models
    Luo, Shan
    STATISTICAL PAPERS, 2020, 61 (03) : 1245 - 1267
  • [24] High-dimensional macroeconomic forecasting and variable selection via penalized regression
    Uematsu, Yoshimasa
    Tanaka, Shinya
    ECONOMETRICS JOURNAL, 2019, 22 (01): : 34 - +
  • [25] High-dimensional local polynomial regression with variable selection and dimension reduction
    Kin Yap Cheung
    Stephen M. S. Lee
    Statistics and Computing, 2024, 34
  • [26] Variable Clustering in High-Dimensional Linear Regression: The R Package clere
    Yengo, Loic
    Jacques, Julien
    Biernacki, Christophe
    Canouil, Mickael
    R JOURNAL, 2016, 8 (01): : 92 - 106
  • [27] The Cluster Elastic Net for High-Dimensional Regression With Unknown Variable Grouping
    Witten, Daniela M.
    Shojaie, Ali
    Zhang, Fan
    TECHNOMETRICS, 2014, 56 (01) : 112 - 122
  • [28] Robust Variable Selection with Optimality Guarantees for High-Dimensional Logistic Regression
    Insolia, Luca
    Kenney, Ana
    Calovi, Martina
    Chiaromonte, Francesca
    STATS, 2021, 4 (03): : 665 - 681
  • [29] Variable selection in high-dimensional sparse multiresponse linear regression models
    Shan Luo
    Statistical Papers, 2020, 61 : 1245 - 1267
  • [30] Regression on High-dimensional Inputs
    Kuleshov, Alexander
    Bernstein, Alexander
    2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2016, : 732 - 739