SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests

被引:44
|
作者
Wu, Qingyao [1 ]
Ye, Yunming [1 ]
Liu, Yang [2 ]
Ng, Michael K. [2 ]
机构
[1] Harbin Inst Technol, Shenzhen Grad Sch, Dept Comp Sci, Harbin, Peoples R China
[2] Hong Kong Baptist Univ, Dept Math, Hong Kong, Hong Kong, Peoples R China
关键词
Genome-wide association study; SNP; random forest; stratified sampling; VARIABLE IMPORTANCE; MISSING DATA; ASSOCIATION;
D O I
10.1109/TNB.2012.2214232
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.
引用
收藏
页码:216 / 227
页数:12
相关论文
共 50 条
  • [41] A genome-wide search for common SNP x SNP interactions on the risk of venous thrombosis
    Greliche, Nicolas
    Germain, Marine
    Lambert, Jean-Charles
    Cohen, William
    Bertrand, Marion
    Dupuis, Anne-Marie
    Letenneur, Luc
    Lathrop, Mark
    Amouyel, Philippe
    Morange, Pierre-Emmanuel
    Tregouet, David-Alexandre
    BMC MEDICAL GENETICS, 2013, 14
  • [42] Genome-wide copy number variations in Bhutia equine breed using SNP genotyping data
    Sharma, Nitesh Kumar
    Singh, Prashant
    Saha, Bibek
    Bhardwaj, Anuradha
    Iquebal, Mir Asif
    Pal, Yash
    Nayan, Varij
    Jaiswal, Sarika
    Giri, Shiv Kumar
    Legha, Ram Avatar
    Bhattacharya, T. K.
    Kumar, Dinesh
    Rai, Anil
    Tripathi, Bhupendra Nath
    INDIAN JOURNAL OF ANIMAL SCIENCES, 2023, 93 (08): : 802 - 805
  • [43] Heritability estimates of distichiasis in Staffordshire bull terriers using pedigrees and genome-wide SNP data
    Dina Joergensen
    Per Madsen
    Ernst-Otto Ropstad
    Frode Lingaas
    Acta Veterinaria Scandinavica, 64
  • [44] Heritability estimates of distichiasis in Staffordshire bull terriers using pedigrees and genome-wide SNP data
    Joergensen, Dina
    Madsen, Per
    Ropstad, Ernst-Otto
    Lingaas, Frode
    ACTA VETERINARIA SCANDINAVICA, 2022, 64 (01)
  • [45] Ultrafast genome-wide scan for SNP-SNP interactions in common complex disease
    Prabhu, Snehit
    Pe'er, Itsik
    GENOME RESEARCH, 2012, 22 (11) : 2230 - 2240
  • [46] How Genome-Wide SNP-SNP Interactions Relate to Nasopharyngeal Carcinoma Susceptibility
    Su, Wen-Hui
    Shugart, Yin Yao
    Chang, Kai-Ping
    Tsang, Ngan-Ming
    Tse, Ka-Po
    Chang, Yu-Sun
    PLOS ONE, 2013, 8 (12):
  • [47] RS-SNP: a random-set method for genome-wide association studies
    D'Addabbo, Annarita
    Palmieri, Orazio
    Latiano, Anna
    Annese, Vito
    Mukherjee, Sayan
    Ancona, Nicola
    BMC GENOMICS, 2011, 12
  • [48] Genome-wide Linkage Analysis with Clustered SNP Markers
    Selmero, Kaja K.
    Brandal, Kristin
    Olstad, Ole K.
    Birkenes, Bard
    Undlien, Dag E.
    Egeland, Thore
    JOURNAL OF BIOMOLECULAR SCREENING, 2009, 14 (01) : 92 - 96
  • [49] Insights and applications of the SolCAP genome-wide SNP Array
    Douches, D.
    Buell, R.
    Coombs, J.
    Manrique, N.
    Massa, A.
    Felcher, K.
    Zarka, D.
    PHYTOPATHOLOGY, 2014, 104 (11) : 169 - 169
  • [50] Psoriasis prediction from genome-wide SNP profiles
    Fang, Shenying
    Fang, Xiangzhong
    Xiong, Momiao
    BMC DERMATOLOGY, 2011, 11