SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests

被引:44
|
作者
Wu, Qingyao [1 ]
Ye, Yunming [1 ]
Liu, Yang [2 ]
Ng, Michael K. [2 ]
机构
[1] Harbin Inst Technol, Shenzhen Grad Sch, Dept Comp Sci, Harbin, Peoples R China
[2] Hong Kong Baptist Univ, Dept Math, Hong Kong, Hong Kong, Peoples R China
关键词
Genome-wide association study; SNP; random forest; stratified sampling; VARIABLE IMPORTANCE; MISSING DATA; ASSOCIATION;
D O I
10.1109/TNB.2012.2214232
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.
引用
收藏
页码:216 / 227
页数:12
相关论文
共 50 条
  • [1] Detection of SNP-SNP Interactions in Genome-wide Association Data Using Random Forests and Association Rules
    Tung Nguyen
    Ly Le
    2018 12TH INTERNATIONAL CONFERENCE ON SOFTWARE, KNOWLEDGE, INFORMATION MANAGEMENT & APPLICATIONS (SKIMA), 2018, : 32 - +
  • [2] Bag of Naive Bayes: biomarker selection and classification from genome-wide SNP data
    Sambo, Francesco
    Trifoglio, Emanuele
    Di Camillo, Barbara
    Toffolo, Gianna M.
    Cobelli, Claudio
    BMC BIOINFORMATICS, 2012, 13
  • [3] Shrunken Dissimilarity Measure for Genome-wide SNP Data Classification
    Liao, Haiyong
    Liu, Yang
    Ng, Michael K.
    OPTIMIZATION AND SYSTEMS BIOLOGY, 2009, 11 : 73 - 80
  • [4] Species Delimitation using Genome-Wide SNP Data
    Leache, Adam D.
    Fujita, Matthew K.
    Minin, Vladimir N.
    Bouckaert, Remco R.
    SYSTEMATIC BIOLOGY, 2014, 63 (04) : 534 - 542
  • [5] Bag of Naïve Bayes: biomarker selection and classification from genome-wide SNP data
    Francesco Sambo
    Emanuele Trifoglio
    Barbara Di Camillo
    Gianna M Toffolo
    Claudio Cobelli
    BMC Bioinformatics, 13
  • [6] Simultaneous analysis of genome-wide SNP data
    Hoggart, C. J.
    De Iorio, M.
    Whittaker, J. C.
    Balding, D. J.
    GENETIC EPIDEMIOLOGY, 2007, 31 (06) : 609 - 609
  • [7] Clustering by genetic ancestry using genome-wide SNP data
    Solovieff, Nadia
    Hartley, Stephen W.
    Baldwin, Clinton T.
    Perls, Thomas T.
    Steinberg, Martin H.
    Sebastiani, Paola
    BMC GENETICS, 2010, 11
  • [8] Clustering by genetic ancestry using genome-wide SNP data
    Nadia Solovieff
    Stephen W Hartley
    Clinton T Baldwin
    Thomas T Perls
    Martin H Steinberg
    Paola Sebastiani
    BMC Genetics, 11
  • [9] Signatures of selection in riverine buffalo populations revealed by genome-wide SNP data
    Saravanan, K. A.
    Rajawat, Divya
    Kumar, Harshit
    Nayak, Sonali Sonejita
    Bhushan, Bharat
    Dutt, Triveni
    Panigrahi, Manjit
    ANIMAL BIOTECHNOLOGY, 2023, 34 (08) : 3343 - 3354
  • [10] Detection of selective sweeps in cattle using genome-wide SNP data
    Holly R Ramey
    Jared E Decker
    Stephanie D McKay
    Megan M Rolf
    Robert D Schnabel
    Jeremy F Taylor
    BMC Genomics, 14