Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model

被引:0
|
作者
Brandon Theodorou
Cao Xiao
Jimeng Sun
机构
[1] University of Illinois at Urbana-Champaign,
[2] Medisyn Inc.,undefined
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Synthetic electronic health records (EHRs) that are both realistic and privacy-preserving offer alternatives to real EHRs for machine learning (ML) and statistical analysis. However, generating high-fidelity EHR data in its original, high-dimensional form poses challenges for existing methods. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal, high-dimensional EHR, which preserve the statistical properties of real EHRs and can train accurate ML models without privacy concerns. HALO generates a probability density function over medical codes, clinical visits, and patient records, allowing for generating realistic EHR data without requiring variable selection or aggregation. Extensive experiments demonstrated that HALO can generate high-fidelity data with high-dimensional disease code probabilities closely mirroring (above 0.9 R2 correlation) real EHR data. HALO also enhances the accuracy of predictive modeling and enables downstream ML models to attain similar accuracy as models trained on genuine data.
引用
收藏
相关论文
共 50 条
  • [1] Author Correction: Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model
    Brandon Theodorou
    Cao Xiao
    Jimeng Sun
    [J]. Nature Communications, 14
  • [2] Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model (vol 14, 5305, 2023)
    Theodorou, Brandon
    Xiao, Cao
    Sun, Jimeng
    [J]. NATURE COMMUNICATIONS, 2023, 14 (01)
  • [3] A Counterfactual Fair Model for Longitudinal Electronic Health Records via Deconfounder
    Liu, Zheng
    Li, Xiaohan
    Yu, Philip S.
    [J]. 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, ICDM 2023, 2023, : 1175 - 1180
  • [4] A large language model for electronic health records
    Xi Yang
    Aokun Chen
    Nima PourNejatian
    Hoo Chang Shin
    Kaleb E. Smith
    Christopher Parisien
    Colin Compas
    Cheryl Martin
    Anthony B. Costa
    Mona G. Flores
    Ying Zhang
    Tanja Magoc
    Christopher A. Harle
    Gloria Lipori
    Duane A. Mitchell
    William R. Hogan
    Elizabeth A. Shenkman
    Jiang Bian
    Yonghui Wu
    [J]. npj Digital Medicine, 5
  • [5] A large language model for electronic health records
    Yang, Xi
    Chen, Aokun
    PourNejatian, Nima
    Shin, Hoo Chang
    Smith, Kaleb E.
    Parisien, Christopher
    Compas, Colin
    Martin, Cheryl
    Costa, Anthony B.
    Flores, Mona G.
    Zhang, Ying
    Magoc, Tanja
    Harle, Christopher A.
    Lipori, Gloria
    Mitchell, Duane A.
    Hogan, William R.
    Shenkman, Elizabeth A.
    Bian, Jiang
    Wu, Yonghui
    [J]. NPJ DIGITAL MEDICINE, 2022, 5 (01)
  • [6] Model-based clustering of high-dimensional longitudinal data via regularization
    Yang, Luoying
    Wu, Tong Tong
    [J]. BIOMETRICS, 2023, 79 (02) : 761 - 774
  • [7] Multivariate autoregressive model estimation for high-dimensional intracranial electrophysiological data
    Endemann, Christopher M.
    Krause, Bryan M.
    Nourski, Kirill, V
    Banks, Matthew, I
    Van Veen, Barry
    [J]. NEUROIMAGE, 2022, 254
  • [8] Comparing high-dimensional confounder control methods for rapid cohort studies from electronic health records
    Low, Yen Sia
    Gallego, Blanca
    Shah, Nigam Haresh
    [J]. JOURNAL OF COMPARATIVE EFFECTIVENESS RESEARCH, 2016, 5 (02) : 179 - 192
  • [9] ESTIMATION OF HIGH-DIMENSIONAL CONNECTIVITY IN FMRI DATA VIA SUBSPACE AUTOREGRESSIVE MODELS
    Ting, Chee-Ming
    Seghouane, Abd-Krim
    Salleh, Sh-Hussain
    [J]. 2016 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP (SSP), 2016,
  • [10] High-dimensional generalized semiparametric model for longitudinal data
    Taavoni, M.
    Arashi, M.
    [J]. STATISTICS, 2021, 55 (04) : 831 - 850