Stable Variable Selection for High-Dimensional Genomic Data with Strong Correlations

被引:0
|
作者
Sarkar R. [1 ]
Manage S. [2 ]
Gao X. [3 ]
机构
[1] Department of Mathematics and Statistics, University of North Carolina at Greensboro, 116 Petty Building, PO Box 26170, Greensboro, 27402, NC
[2] Department of Mathematics, Texas A&M University, Blocker Building, 3368 TAMU, 155 Ireland Street, College Station, 77840, TX
[3] Meta Platforms, Menlo Park, CA
基金
美国国家科学基金会;
关键词
Bi-level sparsity; Minimax concave penalty; Stability; Strong correlation; Variable selection;
D O I
10.1007/s40745-023-00481-5
中图分类号
学科分类号
摘要
High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including the Lasso and MCP, etc. In this paper, we perform comparative study of regularization approaches for variable selection under different correlation structures and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Extensive simulation studies and high-dimensional genomic data analysis on real datasets have demonstrated the advantage of the proposed rPGBS method over some of the most used regularization methods. In particular, rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to some recent methods addressing variable selection with strong correlations: Precision Lasso (Wang et al. in Bioinformatics 35:1181–1187, 2019) and Whitening Lasso (Zhu et al. in Bioinformatics 37:2238–2244, 2021). Moreover, rPGBS has been shown to be computationally efficient across various settings. © 2023, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
引用
收藏
页码:1139 / 1164
页数:25
相关论文
共 50 条
  • [21] Estimation and variable selection for high-dimensional spatial data models
    Hou, Li
    Jin, Baisuo
    Wu, Yuehua
    JOURNAL OF ECONOMETRICS, 2024, 238 (02)
  • [22] Variable selection for longitudinal data with high-dimensional covariates and dropouts
    Zheng, Xueying
    Fu, Bo
    Zhang, Jiajia
    Qin, Guoyou
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018, 88 (04) : 712 - 725
  • [23] Stochastic variational variable selection for high-dimensional microbiome data
    Tung Dang
    Kie Kumaishi
    Erika Usui
    Shungo Kobori
    Takumi Sato
    Yusuke Toda
    Yuji Yamasaki
    Hisashi Tsujimoto
    Yasunori Ichihashi
    Hiroyoshi Iwata
    Microbiome, 10
  • [24] Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis
    Ren, Jie
    Du, Yinhao
    Li, Shaoyu
    Ma, Shuangge
    Jiang, Yu
    Wu, Cen
    GENETIC EPIDEMIOLOGY, 2019, 43 (03) : 276 - 291
  • [25] A reaction norm model for genomic selection using high-dimensional genomic and environmental data
    Jarquin, Diego
    Crossa, Jose
    Lacaze, Xavier
    Du Cheyron, Philippe
    Daucourt, Joelle
    Lorgeou, Josiane
    Piraux, Francis
    Guerreiro, Laurent
    Perez, Paulino
    Calus, Mario
    Burgueno, Juan
    de los Campos, Gustavo
    THEORETICAL AND APPLIED GENETICS, 2014, 127 (03) : 595 - 607
  • [26] A reaction norm model for genomic selection using high-dimensional genomic and environmental data
    Diego Jarquín
    José Crossa
    Xavier Lacaze
    Philippe Du Cheyron
    Joëlle Daucourt
    Josiane Lorgeou
    François Piraux
    Laurent Guerreiro
    Paulino Pérez
    Mario Calus
    Juan Burgueño
    Gustavo de los Campos
    Theoretical and Applied Genetics, 2014, 127 : 595 - 607
  • [27] Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data
    Wang, Haohan
    Lengerich, Benjamin J.
    Aragam, Bryon
    Xing, Eric P.
    BIOINFORMATICS, 2019, 35 (07) : 1181 - 1187
  • [28] PUlasso: High-Dimensional Variable Selection With Presence-Only Data
    Song, Hyebin
    Raskutti, Garvesh
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2020, 115 (529) : 334 - 347
  • [29] Variable selection techniques after multiple imputation in high-dimensional data
    Zahid, Faisal Maqbool
    Faisal, Shahla
    Heumann, Christian
    STATISTICAL METHODS AND APPLICATIONS, 2020, 29 (03): : 553 - 580
  • [30] Variable Selection in High-Dimensional Partially Linear Models with Longitudinal Data
    Yang Yiping
    Xue Liugen
    RECENT ADVANCE IN STATISTICS APPLICATION AND RELATED AREAS, VOLS I AND II, 2009, : 661 - 667