Identification of disease-associated loci using machine learning for genotype and network data integration

被引:7
|
作者
Leal, Luis G. [1 ]
David, Alessia [1 ]
Jarvelin, Marjo-Riita [2 ,3 ,4 ,5 ,6 ]
Sebert, Sylvain [2 ,3 ]
Mannikko, Minna [2 ]
Karhunen, Ville [2 ,3 ,4 ,5 ,6 ]
Seaby, Eleanor [7 ]
Hoggart, Clive [8 ]
Sternberg, Michael J. E. [1 ]
机构
[1] Imperial Coll London, Dept Life Sci, Ctr Integrat Syst Biol & Bioinformat, London SW7 2AZ, England
[2] Univ Oulu, Fac Med, Ctr Life Course Hlth Res, FI-90014 Oulu, Finland
[3] Univ Oulu, Bioctr Oulu, SF-90220 Oulu, Finland
[4] Oulu Univ Hosp, Unit Primary Hlth Care, Oulu 90220, Finland
[5] Imperial Coll London, Sch Publ Hlth, Dept Epidemiol & Biostat, MRC PHE Ctr Environm & Hlth, London W2 1PG, England
[6] Brunel Univ London, Dept Life Sci, Coll Hlth & Life Sci, Uxbridge UB8 3PH, Middx, England
[7] Broad Inst MIT & Harvard, Program Med & Populat Genet, Cambridge, MA 02142 USA
[8] Imperial Coll London, Dept Med, London W2 1PG, England
基金
欧盟地平线“2020”; 英国医学研究理事会; 美国国家卫生研究院; 芬兰科学院; 英国惠康基金;
关键词
RISK;
D O I
10.1093/bioinformatics/btz310
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. Results: We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the interrelatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals' ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user's research needs.
引用
收藏
页码:5182 / 5190
页数:9
相关论文
共 50 条
  • [1] Identification of infectious disease-associated host genes using machine learning techniques
    Barman, Ranjan Kumar
    Mukhopadhyay, Anirban
    Maulik, Ujjwal
    Das, Santasabuj
    [J]. BMC BIOINFORMATICS, 2019, 20 (01)
  • [2] Identification of infectious disease-associated host genes using machine learning techniques
    Ranjan Kumar Barman
    Anirban Mukhopadhyay
    Ujjwal Maulik
    Santasabuj Das
    [J]. BMC Bioinformatics, 20
  • [3] Removing reference mapping biases using limited or no genotype data identifies allelic differences in protein binding at disease-associated loci
    Buchkovich, Martin L.
    Eklund, Karl
    Duan, Qing
    Li, Yun
    Mohlke, Karen L.
    Furey, Terrence S.
    [J]. BMC MEDICAL GENOMICS, 2015, 8
  • [4] Removing reference mapping biases using limited or no genotype data identifies allelic differences in protein binding at disease-associated loci
    Martin L. Buchkovich
    Karl Eklund
    Qing Duan
    Yun Li
    Karen L. Mohlke
    Terrence S. Furey
    [J]. BMC Medical Genomics, 8
  • [5] Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes
    Himmelstein, Daniel S.
    Baranzini, Sergio E.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2015, 11 (07)
  • [6] Identification of the disease-associated genes in periodontitis using the co-expression network
    G. P. Sun
    T. Jiang
    P. F. Xie
    J. Lan
    [J]. Molecular Biology, 2016, 50 : 124 - 131
  • [7] Identification of the Disease-Associated Genes in Periodontitis Using the Co-expression Network
    Sun, G. P.
    Jiang, T.
    Xie, P. F.
    Lan, J.
    [J]. MOLECULAR BIOLOGY, 2016, 50 (01) : 124 - 131
  • [8] Revealing disease-associated pathways by network integration of untargeted metabolomics
    Pirhaji L.
    Milani P.
    Leidl M.
    Curran T.
    Avila-Pacheco J.
    Clish C.B.
    White F.M.
    Saghatelian A.
    Fraenkel E.
    [J]. Nature Methods, 2016, 13 (9) : 770 - 776
  • [9] Data Integration using Machine Learning
    Birgersson, Marcus
    Hansson, Gustav
    Franke, Ulrik
    [J]. 2016 IEEE 20TH INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING WORKSHOP (EDOCW), 2016, : 313 - 322
  • [10] Revealing disease-associated pathways by network integration of untargeted metabolomics
    Pirhaji, Leila
    Milani, Pamela
    Leidl, Mathias
    Curran, Timothy
    Avila-Pacheco, Julian
    Clish, Clary B.
    White, Forest M.
    Saghatelian, Alan
    Fraenkel, Ernest
    [J]. NATURE METHODS, 2016, 13 (09) : 770 - 776