Stable Variable Selection for High-Dimensional Genomic Data with Strong Correlations

被引:0
|
作者
Sarkar R. [1 ]
Manage S. [2 ]
Gao X. [3 ]
机构
[1] Department of Mathematics and Statistics, University of North Carolina at Greensboro, 116 Petty Building, PO Box 26170, Greensboro, 27402, NC
[2] Department of Mathematics, Texas A&M University, Blocker Building, 3368 TAMU, 155 Ireland Street, College Station, 77840, TX
[3] Meta Platforms, Menlo Park, CA
基金
美国国家科学基金会;
关键词
Bi-level sparsity; Minimax concave penalty; Stability; Strong correlation; Variable selection;
D O I
10.1007/s40745-023-00481-5
中图分类号
学科分类号
摘要
High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including the Lasso and MCP, etc. In this paper, we perform comparative study of regularization approaches for variable selection under different correlation structures and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Extensive simulation studies and high-dimensional genomic data analysis on real datasets have demonstrated the advantage of the proposed rPGBS method over some of the most used regularization methods. In particular, rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to some recent methods addressing variable selection with strong correlations: Precision Lasso (Wang et al. in Bioinformatics 35:1181–1187, 2019) and Whitening Lasso (Zhu et al. in Bioinformatics 37:2238–2244, 2021). Moreover, rPGBS has been shown to be computationally efficient across various settings. © 2023, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
引用
收藏
页码:1139 / 1164
页数:25
相关论文
共 50 条
  • [31] Variable selection via combined penalization for high-dimensional data analysis
    Wang, Xiaoming
    Park, Taesung
    Carriere, K. C.
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2010, 54 (10) : 2230 - 2243
  • [32] PALLADIO: a parallel framework for robust variable selection in high-dimensional data
    Barbieri, Matteo
    Fiorini, Samuele
    Tomasi, Federico
    Barla, Annalisa
    PROCEEDINGS OF PYHPC2016: 6TH WORKSHOP ON PYTHON FOR HIGH-PERFORMANCE AND SCIENTIFIC COMPUTING, 2016, : 19 - 26
  • [33] Variable selection techniques after multiple imputation in high-dimensional data
    Faisal Maqbool Zahid
    Shahla Faisal
    Christian Heumann
    Statistical Methods & Applications, 2020, 29 : 553 - 580
  • [34] Integrative analysis and variable selection with multiple high-dimensional data sets
    Ma, Shuangge
    Huang, Jian
    Song, Xiao
    BIOSTATISTICS, 2011, 12 (04) : 763 - 775
  • [35] High-Dimensional Variable Selection in Meta-Analysis for Censored Data
    Liu, Fei
    Dunson, David
    Zou, Fei
    BIOMETRICS, 2011, 67 (02) : 504 - 512
  • [36] GIBBS POSTERIOR FOR VARIABLE SELECTION IN HIGH-DIMENSIONAL CLASSIFICATION AND DATA MINING
    Jiang, Wenxin
    Tanner, Martin A.
    ANNALS OF STATISTICS, 2008, 36 (05): : 2207 - 2231
  • [37] LASSO-type variable selection methods for high-dimensional data
    Fu, Guanghui
    Wang, Pan
    ADVANCES IN COMPUTATIONAL MODELING AND SIMULATION, PTS 1 AND 2, 2014, 444-445 : 604 - 609
  • [38] High-dimensional feature selection for genomic datasets
    Afshar, Majid
    Usefi, Hamid
    KNOWLEDGE-BASED SYSTEMS, 2020, 206
  • [39] Variable selection and estimation in high-dimensional models
    Horowitz, Joel L.
    CANADIAN JOURNAL OF ECONOMICS-REVUE CANADIENNE D ECONOMIQUE, 2015, 48 (02): : 389 - 407
  • [40] High-dimensional graphs and variable selection with the Lasso
    Meinshausen, Nicolai
    Buehlmann, Peter
    ANNALS OF STATISTICS, 2006, 34 (03): : 1436 - 1462