Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

被引:1
|
作者
Yun, Taedong [1 ]
Cosentino, Justin [2 ]
Behsaz, Babak [1 ]
McCaw, Zachary R. [2 ,13 ]
Hill, Davin [3 ,4 ]
Luben, Robert [5 ,6 ,7 ]
Lai, Dongbing [8 ]
Bates, John [9 ]
Yang, Howard [2 ]
Schwantes-An, Tae-Hwi [8 ,10 ]
Zhou, Yuchen [1 ]
Khawaja, Anthony P. [5 ,6 ,7 ]
Carroll, Andrew [2 ]
Hobbs, Brian D. [4 ,11 ,12 ]
Cho, Michael H. [4 ,11 ,12 ]
Mclean, Cory Y. [1 ]
Hormozdiari, Farhad [1 ]
机构
[1] Google Res, Cambridge, MA 02142 USA
[2] Google Res, Mountain View, CA USA
[3] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA USA
[4] Brigham & Womens Hosp, Channing Div Network Med, Boston, MA USA
[5] Moorfields Eye Hosp, NIHR Biomed Res Ctr, London, England
[6] Univ Coll London UCL, Inst Ophthalmol, London, England
[7] Univ Cambridge, MRC Epidemiol Unit, Cambridge, England
[8] Indiana Univ Sch Med, Dept Med & Mol Genet, Indianapolis, IN USA
[9] Verily Life Sci, South San Francisco, CA USA
[10] Indiana Univ Sch Med, Dept Med, Div Cardiol, Indianapolis, IN USA
[11] Brigham & Womens Hosp, Div Pulm & Crit Care Med, Boston, MA USA
[12] Harvard Med Sch, Boston, MA USA
[13] Insitro, South San Francisco, CA USA
基金
英国科研创新办公室; 美国国家卫生研究院; 英国医学研究理事会;
关键词
OBSTRUCTIVE PULMONARY-DISEASE; WIDE ASSOCIATION; CORRELATED PHENOTYPES; RISK; COPD; PHOTOPLETHYSMOGRAPHY; INSIGHTS; POWER; SET;
D O I
10.1038/s41588-024-01831-6
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD-spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction. Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE) uses machine learning to generate low-dimensional representations of healthcare data. Applied to lung spirograms and blood volume photoplethysmograms, REGLE factors capture additional information beyond expert-defined features, suggesting the utility of this approach.
引用
收藏
页码:1604 / 1613
页数:27
相关论文
共 50 条
  • [1] Flexible High-Dimensional Unsupervised Learning with Missing Data
    Wei, Yuhong
    Tang, Yang
    McNicholas, Paul D.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (03) : 610 - 621
  • [2] Fused Feature Representation Discovery for High-Dimensional and Sparse Data
    Suzuki, Jun
    Nagata, Masaaki
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2014, : 1593 - 1599
  • [3] Efficient Sparse Representation for Learning With High-Dimensional Data
    Chen, Jie
    Yang, Shengxiang
    Wang, Zhu
    Mao, Hua
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (08) : 4208 - 4222
  • [4] Efficient Representation Learning for High-Dimensional Imbalance Data
    Mirza, Bilal
    Kok, Stanley
    Lin, Zhiping
    Yeo, Yong Kiang
    Lai, Xiaoping
    Cao, Jiuwen
    Sepulveda, Jose
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2016, : 511 - 515
  • [5] Identifying redundant features using unsupervised learning for high-dimensional data
    Danasingh, Asir Antony Gnana Singh
    Subramanian, Appavu alias Balamurugan
    Epiphany, Jebamalar Leavline
    [J]. SN APPLIED SCIENCES, 2020, 2 (08):
  • [6] Identifying redundant features using unsupervised learning for high-dimensional data
    Asir Antony Gnana Singh Danasingh
    Appavu alias Balamurugan Subramanian
    Jebamalar Leavline Epiphany
    [J]. SN Applied Sciences, 2020, 2
  • [7] Explainable deep transfer learning model for disease risk prediction using high-dimensional genomic data
    Liu, Long
    Meng, Qingyu
    Weng, Cherry
    Lu, Qing
    Wang, Tong
    Wen, Yalu
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2022, 18 (07)
  • [8] Flexible co-data learning for high-dimensional prediction
    van Nee, Mirrelijn M.
    Wessels, Lodewyk F. A.
    van de Wiel, Mark A.
    [J]. STATISTICS IN MEDICINE, 2021, 40 (26) : 5910 - 5925
  • [9] Learning high-dimensional data
    Verleysen, M
    [J]. LIMITATIONS AND FUTURE TRENDS IN NEURAL COMPUTATION, 2003, 186 : 141 - 162
  • [10] Broad and deep neural network for high-dimensional data representation learning
    Feng, Qiying
    Liu, Zhulin
    Chen, C. L. Philip
    [J]. INFORMATION SCIENCES, 2022, 599 : 127 - 146