Comparing a few SNP calling algorithms using low-coverage sequencing data

被引:83
|
作者
Yu, Xiaoqing [1 ]
Sun, Shuying [1 ,2 ]
机构
[1] Case Western Reserve Univ, Dept Epidemiol & Biostat, Cleveland, OH 44106 USA
[2] Texas State Univ, Dept Math, San Marcos, TX 78666 USA
来源
BMC BIOINFORMATICS | 2013年 / 14卷
关键词
Next generation sequencing; SNP calling; Low-coverage; Single-sample; SOAPsnp; Atlas-SNP2; SAMtools; GATK; ASSOCIATION; DISCOVERY; VARIANTS; SUSCEPTIBILITY; GENE; POLYMORPHISM; FRAMEWORK; DISEASE; LOCI;
D O I
10.1186/1471-2105-14-274
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations. Results: To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs' quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs. Conclusions: Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Next-Generation Sequencing Data Analysis on Pool-Seq and Low-Coverage Retinoblastoma Data
    Gülistan Özdemir Özdoğan
    Hilal Kaya
    [J]. Interdisciplinary Sciences: Computational Life Sciences, 2020, 12 : 302 - 310
  • [42] Phylogenomics from low-coverage whole-genome sequencing
    Zhang, Feng
    Ding, Yinhuan
    Zhu, Chao-Dong
    Zhou, Xin
    Orr, Michael C.
    Scheu, Stefan
    Luan, Yun-Xia
    [J]. METHODS IN ECOLOGY AND EVOLUTION, 2019, 10 (04): : 507 - 517
  • [43] Powerful eQTL mapping through low-coverage RNA sequencing
    Schwarz, Tommer
    Boltz, Toni
    Hou, Kangcheng
    Bot, Merel
    Duan, Chenda
    Loohuis, Loes Olde
    Boks, Marco P.
    Kahn, Rene S.
    Ophoff, Roel A.
    Pasaniuc, Bogdan
    [J]. HUMAN GENETICS AND GENOMICS ADVANCES, 2022, 3 (03):
  • [44] Low-Coverage Sequencing Imputation from millions of reference samples
    Rubinacci, Simone
    Delaneau, Olivier
    [J]. HUMAN HEREDITY, 2022, VOL. (SUPPL 1) : 4 - 5
  • [45] Rare Variant Association Testing Under Low-Coverage Sequencing
    Navon, Oron
    Sul, Jae Hoon
    Han, Buhm
    Conde, Lucia
    Bracci, Paige M.
    Riby, Jacques
    Skibola, Christine F.
    Eskin, Eleazar
    Halperin, Eran
    [J]. GENETICS, 2013, 194 (03): : 769 - +
  • [46] Detecting inherited and novel structural variants in low-coverage parent-child sequencing data
    Spence, Melissa
    Banuelos, Mario
    Marcia, Roummel F.
    Sindi, Suzanne
    [J]. METHODS, 2020, 173 : 61 - 68
  • [47] AKSmooth: Enhancing low-coverage bisulfite sequencing data via kernel-based smoothing
    Chen, Junfang
    Lutsik, Pavlo
    Akulenko, Ruslan
    Walter, Joern
    Helms, Volkhard
    [J]. JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2014, 12 (06)
  • [48] Detecting Pathogenic Structural Variants with Low-Coverage PacBio Sequencing
    Hickey, L.
    Wenger, A. M.
    Baybayan, P.
    Peluso, P.
    Korlach, J.
    [J]. EUROPEAN JOURNAL OF HUMAN GENETICS, 2018, 26 : 729 - 729
  • [49] Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes
    Rubinacci, Simone
    Hofmeister, Robin J.
    da Mota, Barbara Sousa
    Delaneau, Olivier
    [J]. EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 50 - 50
  • [50] A novel nonlinear dimension reduction approach to infer population structure for low-coverage sequencing data
    Miao Zhang
    Yiwen Liu
    Hua Zhou
    Joseph Watkins
    Jin Zhou
    [J]. BMC Bioinformatics, 22