High-dimensional regression with potential prior information on variable importance

被引：0

作者：

Stokell, Benjamin G. ^{[1
]}

Shah, Rajen D. ^{[1
]}

机构：

[1] Univ Cambridge, Cambridge, England

来源：

STATISTICS AND COMPUTING | 2022年 / 32卷 / 03期

基金：

英国工程与自然科学研究理事会;

关键词：

High-dimensional data; Low variance filter; Lasso; Ridge regression; Missing data; Corrupted data; SELECTION; LASSO;

D O I：

10.1007/s11222-022-10110-5

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include the ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number M of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a log M price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and in time series settings. An R package is available on github.

引用

页数：12

共 50 条

[21] Consistent Bayesian information criterion based on a mixture prior for possibly high-dimensional multivariate linear regression models
Kono, Haruki
Kubokawa, Tatsuya
SCANDINAVIAN JOURNAL OF STATISTICS, 2023, 50 (03) : 1022 - 1047
[22] High-dimensional local polynomial regression with variable selection and dimension reduction
Cheung, Kin Yap
Lee, Stephen M. S.
STATISTICS AND COMPUTING, 2024, 34 (01)
[23] Variable selection in high-dimensional sparse multiresponse linear regression models
Luo, Shan
STATISTICAL PAPERS, 2020, 61 (03) : 1245 - 1267
[24] High-dimensional macroeconomic forecasting and variable selection via penalized regression
Uematsu, Yoshimasa
Tanaka, Shinya
ECONOMETRICS JOURNAL, 2019, 22 (01): : 34 - +
[25] High-dimensional local polynomial regression with variable selection and dimension reduction
Kin Yap Cheung
Stephen M. S. Lee
Statistics and Computing, 2024, 34
[26] Variable Clustering in High-Dimensional Linear Regression: The R Package clere
Yengo, Loic
Jacques, Julien
Biernacki, Christophe
Canouil, Mickael
R JOURNAL, 2016, 8 (01): : 92 - 106
[27] The Cluster Elastic Net for High-Dimensional Regression With Unknown Variable Grouping
Witten, Daniela M.
Shojaie, Ali
Zhang, Fan
TECHNOMETRICS, 2014, 56 (01) : 112 - 122
[28] Robust Variable Selection with Optimality Guarantees for High-Dimensional Logistic Regression
Insolia, Luca
Kenney, Ana
Calovi, Martina
Chiaromonte, Francesca
STATS, 2021, 4 (03): : 665 - 681
[29] Variable selection in high-dimensional sparse multiresponse linear regression models
Shan Luo
Statistical Papers, 2020, 61 : 1245 - 1267
[30] Regression on High-dimensional Inputs
Kuleshov, Alexander
Bernstein, Alexander
2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2016, : 732 - 739

← 1 2 3 4 5 →