Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data

被引:14
|
作者
Cai, Tianxi [1 ]
Liu, Molei [1 ]
Xia, Yin [2 ]
机构
[1] Harvard Univ, Dept Biostat, Harvard Sch Publ Hlth, Boston, MA 02115 USA
[2] Fudan Univ, Sch Management, Dept Stat, Shanghai, Peoples R China
关键词
DataSHIELD; Distributed learning; High dimensionality; Model heterogeneity; Rate optimality; Sparsistency; CONFIDENCE-INTERVALS; LEVEL DATA; SELECTION; DATASHIELD; MODELS; RATES;
D O I
10.1080/01621459.2021.1904958
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high-dimensional setting. The challenge is even more pronounced when the individual-level data cannot be shared across studies, known as DataSHIELD contraint. Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between-study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts.
引用
收藏
页码:2105 / 2119
页数:15
相关论文
共 50 条
  • [1] Integrative analysis of individual-level data and high-dimensional summary statistics
    Fu, Sheng
    Deng, Lu
    Zhang, Han
    Qin, Jing
    Yu, Kai
    [J]. BIOINFORMATICS, 2023, 39 (04)
  • [2] Integrative clustering of high-dimensional data with joint and individual clusters
    Hellton, Kristoffer H.
    Thoresen, Magne
    [J]. BIOSTATISTICS, 2016, 17 (03) : 537 - 548
  • [3] High-dimensional integrative copula discriminant analysis for multiomics data
    He, Yong
    Chen, Hao
    Sun, Hao
    Ji, Jiadong
    Shi, Yufeng
    Zhang, Xinsheng
    Liu, Lei
    [J]. STATISTICS IN MEDICINE, 2020, 39 (30) : 4869 - 4884
  • [4] Factor Analysis Regression for Predictive Modeling with High-Dimensional Data
    Carter, Randy
    Michael, Netsanet
    [J]. JOURNAL OF QUANTITATIVE ECONOMICS, 2022, 20 (SUPPL 1) : 115 - 132
  • [5] High-Dimensional Heteroscedastic Regression with an Application to eQTL Data Analysis
    Daye, Z. John
    Chen, Jinbo
    Li, Hongzhe
    [J]. BIOMETRICS, 2012, 68 (01) : 316 - 326
  • [6] Factor Analysis Regression for Predictive Modeling with High-Dimensional Data
    Randy Carter
    Netsanet Michael
    [J]. Journal of Quantitative Economics, 2022, 20 : 115 - 132
  • [7] Factor analysis of high-dimensional heterogeneous data for structural characterization
    Machado, AMC
    Gee, JC
    Campos, MFM
    [J]. MEDICAL IMAGING: 2001: IMAGE PROCESSING, PTS 1-3, 2001, 4322 : 995 - 1004
  • [8] iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data
    Wang, Wenting
    Baladandayuthapani, Veerabhadran
    Morris, Jeffrey S.
    Broom, Bradley M.
    Manyam, Ganiraju
    Do, Kim-Anh
    [J]. BIOINFORMATICS, 2013, 29 (02) : 149 - 159
  • [9] Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data
    Xiong, Lie
    Kuan, Pei-Fen
    Tian, Jianan
    Keles, Sunduz
    Wang, Sijian
    [J]. CANCER INFORMATICS, 2014, 13 : 123 - 131
  • [10] Integrative analysis and variable selection with multiple high-dimensional data sets
    Ma, Shuangge
    Huang, Jian
    Song, Xiao
    [J]. BIOSTATISTICS, 2011, 12 (04) : 763 - 775