Comparing a few SNP calling algorithms using low-coverage sequencing data

被引:83
|
作者
Yu, Xiaoqing [1 ]
Sun, Shuying [1 ,2 ]
机构
[1] Case Western Reserve Univ, Dept Epidemiol & Biostat, Cleveland, OH 44106 USA
[2] Texas State Univ, Dept Math, San Marcos, TX 78666 USA
来源
BMC BIOINFORMATICS | 2013年 / 14卷
关键词
Next generation sequencing; SNP calling; Low-coverage; Single-sample; SOAPsnp; Atlas-SNP2; SAMtools; GATK; ASSOCIATION; DISCOVERY; VARIANTS; SUSCEPTIBILITY; GENE; POLYMORPHISM; FRAMEWORK; DISEASE; LOCI;
D O I
10.1186/1471-2105-14-274
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations. Results: To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs' quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs. Conclusions: Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Comparing a few SNP calling algorithms using low-coverage sequencing data
    Xiaoqing Yu
    Shuying Sun
    [J]. BMC Bioinformatics, 14
  • [2] SNP genotyping and parameter estimation in polyploids using low-coverage sequencing data
    Blischak, Paul D.
    Kubatko, Laura S.
    Wolfe, Andrea D.
    [J]. BIOINFORMATICS, 2018, 34 (03) : 407 - 415
  • [3] SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples
    Le, Si Quang
    Durbin, Richard
    [J]. GENOME RESEARCH, 2011, 21 (06) : 952 - 960
  • [4] Comparison of Genotype Imputation for SNP Array and Low-Coverage Whole-Genome Sequencing Data
    Deng, Tianyu
    Zhang, Pengfei
    Garrick, Dorian
    Gao, Huijiang
    Wang, Lixian
    Zhao, Fuping
    [J]. FRONTIERS IN GENETICS, 2022, 12
  • [5] NanoSNP: a progressive and haplotype-aware SNP caller on low-coverage nanopore sequencing data
    Huang, Neng
    Xu, Minghua
    Nie, Fan
    Ni, Peng
    Xiao, Chuan-Le
    Luo, Feng
    Wang, Jianxin
    [J]. BIOINFORMATICS, 2023, 39 (01)
  • [6] Improved computations for relationship inference using low-coverage sequencing data
    Mostad, Petter
    Tillmar, Andreas
    Kling, Daniel
    [J]. BMC BIOINFORMATICS, 2023, 24 (01)
  • [7] Improved computations for relationship inference using low-coverage sequencing data
    Petter Mostad
    Andreas Tillmar
    Daniel Kling
    [J]. BMC Bioinformatics, 24
  • [8] Variant calling in low-coverage whole genome sequencing of a Native American population sample
    Bizon, Chris
    Spiegel, Michael
    Chasse, Scott A.
    Gizer, Ian R.
    Li, Yun
    Malc, Ewa P.
    Mieczkowski, Piotr A.
    Sailsbery, Josh K.
    Wang, Xiaoshu
    Ehlers, Cindy L.
    Wilhelmsen, Kirk C.
    [J]. BMC GENOMICS, 2014, 15
  • [9] Linkage disequilibrium based genotype calling from low-coverage shotgun sequencing reads
    Duitama, Jorge
    Kennedy, Justin
    Dinakar, Sanjiv
    Hernandez, Yoezen
    Wu, Yufeng
    Mandoiu, Ion I.
    [J]. BMC BIOINFORMATICS, 2011, 12
  • [10] Variant calling in low-coverage whole genome sequencing of a Native American population sample
    Chris Bizon
    Michael Spiegel
    Scott A Chasse
    Ian R Gizer
    Yun Li
    Ewa P Malc
    Piotr A Mieczkowski
    Josh K Sailsbery
    Xiaoshu Wang
    Cindy L Ehlers
    Kirk C Wilhelmsen
    [J]. BMC Genomics, 15