SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests

被引:44
|
作者
Wu, Qingyao [1 ]
Ye, Yunming [1 ]
Liu, Yang [2 ]
Ng, Michael K. [2 ]
机构
[1] Harbin Inst Technol, Shenzhen Grad Sch, Dept Comp Sci, Harbin, Peoples R China
[2] Hong Kong Baptist Univ, Dept Math, Hong Kong, Hong Kong, Peoples R China
关键词
Genome-wide association study; SNP; random forest; stratified sampling; VARIABLE IMPORTANCE; MISSING DATA; ASSOCIATION;
D O I
10.1109/TNB.2012.2214232
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.
引用
收藏
页码:216 / 227
页数:12
相关论文
共 50 条
  • [21] Genome-wide SNP data unravel the ancestry and signatures of divergent selection in Ghurrah pigs of India
    Mehrotra, Arnav
    Bhushan, Bharat
    Karthikeyan, A.
    Singh, Akansha
    Panda, Snehasmita
    Bhati, Meenu
    Panigrahi, Manjit
    Dutt, Triveni
    Mishra, P. Bishnu
    Pausch, Hubert
    Kumar, Amit
    LIVESTOCK SCIENCE, 2021, 250
  • [22] Discovering SNP-disease relationships in genome-wide SNP data using an improved harmony search based on SNP locus and genetic inheritance patterns
    Esmaeili, Fariba
    Narimani, Zahra
    Vasighi, Mahdi
    PLOS ONE, 2023, 18 (10):
  • [23] Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests
    Thanh-Tung Nguyen
    Joshua Zhexue Huang
    Qingyao Wu
    Thuy Thi Nguyen
    Mark Junjie Li
    BMC Genomics, 16
  • [24] Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests
    Thanh-Tung Nguyen
    Huang, Joshua Zhexue
    Wu, Qingyao
    Thuy Thi Nguyen
    Li, Mark Junjie
    BMC GENOMICS, 2015, 16
  • [25] Genome-wide evaluation of the public SNP databases
    Jiang, RH
    Duan, JC
    Windemuth, A
    Stephens, JC
    Judson, R
    Xu, CB
    PHARMACOGENOMICS, 2003, 4 (06) : 779 - 789
  • [26] VizStruct for visualization of genome-wide SNP analyses
    Bhasi, Kavitha
    Zhang, Li
    Brazeau, Daniel
    Zhang, Aidong
    Ramanathan, Murali
    BIOINFORMATICS, 2006, 22 (13) : 1569 - 1576
  • [27] Prediction of treatment response in rheumatoid arthritis patients using genome-wide SNP data
    Cherlin, Svetlana
    Plant, Darren
    Taylor, John C.
    Colombo, Marco
    Spiliopoulou, Athina
    Tzanis, Evan
    Morgan, Ann W.
    Barnes, Michael R.
    McKeigue, Paul
    Barrett, Jennifer H.
    Pitzalis, Costantino
    Barton, Anne
    Cordell, Heather J.
    GENETIC EPIDEMIOLOGY, 2018, 42 (08) : 754 - 771
  • [28] Development of genome-wide SNP assays for rice
    McCouch, Susan R.
    Zhao, Keyan
    Wright, Mark
    Tung, Chih-Wei
    Ebana, Kaworu
    Thomson, Michael
    Reynolds, Andy
    Wang, Diane
    DeClerck, Genevieve
    Ali, Md Liakat
    McClung, Anna
    Eizenga, Georgia
    Bustamante, Carlos
    BREEDING SCIENCE, 2010, 60 (05) : 524 - 535
  • [29] Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
    Botta, Vincent
    Louppe, Gilles
    Geurts, Pierre
    Wehenkel, Louis
    PLOS ONE, 2014, 9 (04):
  • [30] Correction to: Genome-wide SNP data unveils the globalization of domesticated pigs
    Bin Yang
    Leilei Cui
    Miguel Perez-Enciso
    Aleksei Traspov
    Richard P. M. A. Crooijmans
    Natalia Zinovieva
    Lawrence B. Schook
    Alan Archibald
    Kesinee Gatphayak
    Christophe Knorr
    Alex Triantafyllidis
    Panoraia Alexandri
    Gono Semiadi
    Olivier Hanotte
    Deodália Dias
    Peter Dovč
    Pekka Uimari
    Laura Iacolina
    Massimo Scandura
    Martien A. M. Groenen
    Lusheng Huang
    Hendrik-Jan Megens
    Genetics Selection Evolution, 52