Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

被引:1
|
作者
Yun, Taedong [1 ]
Cosentino, Justin [2 ]
Behsaz, Babak [1 ]
McCaw, Zachary R. [2 ,13 ]
Hill, Davin [3 ,4 ]
Luben, Robert [5 ,6 ,7 ]
Lai, Dongbing [8 ]
Bates, John [9 ]
Yang, Howard [2 ]
Schwantes-An, Tae-Hwi [8 ,10 ]
Zhou, Yuchen [1 ]
Khawaja, Anthony P. [5 ,6 ,7 ]
Carroll, Andrew [2 ]
Hobbs, Brian D. [4 ,11 ,12 ]
Cho, Michael H. [4 ,11 ,12 ]
Mclean, Cory Y. [1 ]
Hormozdiari, Farhad [1 ]
机构
[1] Google Res, Cambridge, MA 02142 USA
[2] Google Res, Mountain View, CA USA
[3] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA USA
[4] Brigham & Womens Hosp, Channing Div Network Med, Boston, MA USA
[5] Moorfields Eye Hosp, NIHR Biomed Res Ctr, London, England
[6] Univ Coll London UCL, Inst Ophthalmol, London, England
[7] Univ Cambridge, MRC Epidemiol Unit, Cambridge, England
[8] Indiana Univ Sch Med, Dept Med & Mol Genet, Indianapolis, IN USA
[9] Verily Life Sci, South San Francisco, CA USA
[10] Indiana Univ Sch Med, Dept Med, Div Cardiol, Indianapolis, IN USA
[11] Brigham & Womens Hosp, Div Pulm & Crit Care Med, Boston, MA USA
[12] Harvard Med Sch, Boston, MA USA
[13] Insitro, South San Francisco, CA USA
基金
英国科研创新办公室; 美国国家卫生研究院; 英国医学研究理事会;
关键词
OBSTRUCTIVE PULMONARY-DISEASE; WIDE ASSOCIATION; CORRELATED PHENOTYPES; RISK; COPD; PHOTOPLETHYSMOGRAPHY; INSIGHTS; POWER; SET;
D O I
10.1038/s41588-024-01831-6
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD-spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction. Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE) uses machine learning to generate low-dimensional representations of healthcare data. Applied to lung spirograms and blood volume photoplethysmograms, REGLE factors capture additional information beyond expert-defined features, suggesting the utility of this approach.
引用
收藏
页码:1604 / 1613
页数:27
相关论文
共 50 条
  • [21] Reconstruction and Decomposition of High-Dimensional Landscapes via Unsupervised Learning
    Lei, Jing
    Akhter, Nasrin
    Qiao, Wanli
    Shehu, Amarda
    [J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 2505 - 2513
  • [22] Scalable and Interpretable Data Representation for High-Dimensional, Complex Data
    Kim, Been
    Patel, Kayur
    Rostamizadeh, Afshin
    Shah, Julie
    [J]. PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 1763 - 1769
  • [23] Hybrid fast unsupervised feature selection for high-dimensional data
    Manbari, Zhaleh
    AkhlaghianTab, Fardin
    Salavati, Chiman
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 124 : 97 - 118
  • [24] Proposing a Dimensionality Reduction Technique With an Inequality for Unsupervised Learning from High-Dimensional Big Data
    Ismkhan, Hassan
    Izadi, Mohammad
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2023, 53 (06): : 3880 - 3889
  • [25] Prediction of vancomycin dose on high-dimensional data using machine learning techniques
    Huang, Xiaohui
    Yu, Ze
    Wei, Xin
    Shi, Junfeng
    Wang, Yu
    Wang, Zeyuan
    Chen, Jihui
    Bu, Shuhong
    Li, Lixia
    Gao, Fei
    Zhang, Jian
    Xu, Ajing
    [J]. EXPERT REVIEW OF CLINICAL PHARMACOLOGY, 2021, 14 (06) : 761 - 771
  • [26] Representation and classification of high-dimensional biomedical spectral data
    Pedrycz, W.
    Lee, D. J.
    Pizzi, N. J.
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2010, 13 (04) : 423 - 436
  • [27] Nonlinear Causal Discovery for High-Dimensional Deterministic Data
    Zeng, Yan
    Hao, Zhifeng
    Cai, Ruichu
    Xie, Feng
    Huang, Libo
    Shimizu, Shohei
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (05) : 2234 - 2245
  • [28] The Validation and Assessment of Machine Learning: A Game of Prediction from High-Dimensional Data
    Pers, Tune H.
    Albrechtsen, Anders
    Holst, Claus
    Sorensen, Thorkild I. A.
    Gerds, Thomas A.
    [J]. PLOS ONE, 2009, 4 (08):
  • [29] Representation and classification of high-dimensional biomedical spectral data
    W. Pedrycz
    D. J. Lee
    N. J. Pizzi
    [J]. Pattern Analysis and Applications, 2010, 13 : 423 - 436
  • [30] Improving Genomic Prediction Using High-Dimensional Secondary Phenotypes
    Arouisse, Bader
    Theeuwen, Tom P. J. M.
    van Eeuwijk, Fred A.
    Kruijer, Willem
    [J]. FRONTIERS IN GENETICS, 2021, 12