Transcriptome prediction performance across machine learning models and diverse ancestries

被引:17
|
作者
Okoro, Paul C. [1 ]
Schubert, Ryan [2 ]
Guo, Xiuqing [3 ,4 ]
Johnson, W. Craig [5 ]
Rotter, Jerome, I [3 ,4 ]
Hoeschele, Ina [6 ,7 ,8 ]
Liu, Yongmei [9 ]
Im, Hae Kyung [10 ]
Luke, Amy [11 ]
Dugas, Lara R. [11 ,12 ]
Wheeler, Heather E. [1 ,13 ,14 ]
机构
[1] Loyola Univ Chicago, Program Bioinformat, Chicago, IL 60660 USA
[2] Loyola Univ Chicago, Dept Math & Stat, Chicago, IL USA
[3] Harbor UCLA Med Ctr, Inst Translat Genom & Populat Sci, Lundquist Inst, Torrance, CA 90509 USA
[4] Harbor UCLA Med Ctr, Dept Pediat, Torrance, CA 90509 USA
[5] Univ Washington, Dept Biostat, Seattle, WA 98195 USA
[6] Virginia Tech, Fralin Life Sci Inst, Blacksburg, VA USA
[7] Virginia Tech, Dept Stat, Blacksburg, VA USA
[8] Wake Forest Sch Med, Winston Salem, NC 27101 USA
[9] Duke Univ, Sch Med, Dept Med, Durham, NC 27706 USA
[10] Univ Chicago, Dept Med, Sect Genet Med, 5841 S Maryland Ave, Chicago, IL 60637 USA
[11] Loyola Univ Chicago, Parkinson Sch Hlth Sci & Publ Hlth, Dept Publ Hlth Sci, Maywood, IL USA
[12] Univ Cape Town, Fac Hlth Sci, Dept Human Biol, Cape Town, South Africa
[13] Loyola Univ Chicago, Dept Biol, Chicago, IL 60660 USA
[14] Loyola Univ Chicago, Dept Comp Sci, Chicago, IL 60660 USA
来源
关键词
GENOME-WIDE ASSOCIATION; GENE-EXPRESSION; VARIABLE SELECTION; COMPLEX TRAITS; REGRESSION; CETP; STRATIFICATION; REGULARIZATION; INFERENCE; HDL;
D O I
10.1016/j.xhgg.2020.100019
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Transcriptome prediction methods such as PrediXcan and FUSION have become popular in complex trait mapping. Most transcriptome prediction models have been trained in European populations using methods that make parametric linear assumptions like the elastic net (EN). To potentially further optimize imputation performance of gene expression across global populations, we built transcriptome prediction models using both linear and non-linear machine learning (ML) algorithms and evaluated their performance in comparison to EN. We trained models using genotype and blood monocyte transcriptome data from the Multi-Ethnic Study of Atherosclerosis (MESA) comprising individuals of African, Hispanic, and European ancestries and tested them using genotype and whole-blood transcriptome data from the Modeling the Epidemiology Transition Study (METS) comprising individuals of African ancestries. We show that the prediction performance is highest when the training and the testing population share similar ancestries regardless of the prediction algorithm used. While EN generally outperformed random forest (RF), support vector regression (SVR), and K nearest neighbor (KNN), we found that RF outperformed EN for some genes, particularly between disparate ancestries, suggesting potential robustness and reduced variability of RF imputation performance across global populations. When applied to a high-density lipoprotein (HDL) phenotype, we show including RF prediction models in PrediXcan revealed potential gene associations missed by EN models. Therefore, by integrating other ML modeling into PrediXcan and diversifying our training populations to include more global ancestries, we may uncover new genes associated with complex traits.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Evaluating the Performance of Explainable Machine Learning Models in Traffic Accidents Prediction in California
    Parra, Camilo
    Ponce, Carlos
    Salas F, Rodrigo
    2020 39TH INTERNATIONAL CONFERENCE OF THE CHILEAN COMPUTER SCIENCE SOCIETY (SCCC), 2020,
  • [32] Deep learning models for stock prediction on diverse datasets
    Sable, Rachna
    Goel, Shivani
    Chatterjee, Pradeep
    Jindal, Mani
    JOURNAL OF ADVANCED APPLIED SCIENTIFIC RESEARCH, 2024, 6 (03): : 25 - 38
  • [33] Prediction of Student's Performance With Learning Coefficients Using Regression Based Machine Learning Models
    Asthana, Pallavi
    Mishra, Sumita
    Gupta, Nishu
    Derawi, Mohammad
    Kumar, Anil
    IEEE ACCESS, 2023, 11 : 72732 - 72742
  • [34] DEA and Machine Learning for Performance Prediction
    Zhang, Zhishuo
    Xiao, Yao
    Niu, Huayong
    MATHEMATICS, 2022, 10 (10)
  • [35] Machine learning models for daily net radiation prediction across different climatic zones of China
    Yu, Haiying
    Jiang, Shouzheng
    Chen, Minzhi
    Wang, Mingjun
    Shi, Rui
    Li, Songyu
    Wu, Jinfeng
    Kui, Xiu
    Zou, Haoting
    Zhan, Cun
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [36] Prediction of heat waves using meteorological variables in diverse regions of Iran with advanced machine learning models
    Asadollah, Seyed Babak Haji Seyed
    Khan, Najeebullah
    Sharafati, Ahmad
    Shahid, Shamsuddin
    Chung, Eun-Sung
    Wang, Xiao-Jun
    STOCHASTIC ENVIRONMENTAL RESEARCH AND RISK ASSESSMENT, 2022, 36 (07) : 1959 - 1974
  • [37] Enhancing ICU Risk Prediction Through Diverse Multicenter Machine Learning Models: A Critical Care Perspective
    Xie, Yun
    Wang, Ruilan
    CRITICAL CARE MEDICINE, 2024, 52 (12) : e637 - e638
  • [38] Comparative Analysis of Machine Learning Models for Crop Yield Prediction Across Multiple Crop Types
    Yashraj Patil
    Harikrishnan Ramachandran
    Sridhevi Sundararajan
    P. Srideviponmalar
    SN Computer Science, 6 (1)
  • [39] Prediction of heat waves using meteorological variables in diverse regions of Iran with advanced machine learning models
    Seyed Babak Haji Seyed Asadollah
    Najeebullah Khan
    Ahmad Sharafati
    Shamsuddin Shahid
    Eun-Sung Chung
    Xiao-Jun Wang
    Stochastic Environmental Research and Risk Assessment, 2022, 36 : 1959 - 1974
  • [40] Machine learning assisted hybrid models can improve streamflow simulation in diverse catchments across the conterminous US
    Konapala, Goutam
    Kao, Shih-Chieh
    Painter, Scott L.
    Lu, Dan
    ENVIRONMENTAL RESEARCH LETTERS, 2020, 15 (10)