Comparing a few SNP calling algorithms using low-coverage sequencing data

被引：83

作者：

Yu, Xiaoqing ^{[1
]}

Sun, Shuying ^{[1
,2
]}

机构：

[1] Case Western Reserve Univ, Dept Epidemiol & Biostat, Cleveland, OH 44106 USA

[2] Texas State Univ, Dept Math, San Marcos, TX 78666 USA

来源：

BMC BIOINFORMATICS | 2013年 / 14卷

关键词：

Next generation sequencing; SNP calling; Low-coverage; Single-sample; SOAPsnp; Atlas-SNP2; SAMtools; GATK; ASSOCIATION; DISCOVERY; VARIANTS; SUSCEPTIBILITY; GENE; POLYMORPHISM; FRAMEWORK; DISEASE; LOCI;

D O I：

10.1186/1471-2105-14-274

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations. Results: To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs' quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs. Conclusions: Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.

引用

页数：15

共 50 条

[1] Comparing a few SNP calling algorithms using low-coverage sequencing data
Xiaoqing Yu
Shuying Sun
[J]. BMC Bioinformatics, 14
[2] SNP genotyping and parameter estimation in polyploids using low-coverage sequencing data
Blischak, Paul D.
Kubatko, Laura S.
Wolfe, Andrea D.
[J]. BIOINFORMATICS, 2018, 34 (03) : 407 - 415
[3] SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples
Le, Si Quang
Durbin, Richard
[J]. GENOME RESEARCH, 2011, 21 (06) : 952 - 960
[4] Comparison of Genotype Imputation for SNP Array and Low-Coverage Whole-Genome Sequencing Data
Deng, Tianyu
Zhang, Pengfei
Garrick, Dorian
Gao, Huijiang
Wang, Lixian
Zhao, Fuping
[J]. FRONTIERS IN GENETICS, 2022, 12
[5] NanoSNP: a progressive and haplotype-aware SNP caller on low-coverage nanopore sequencing data
Huang, Neng
Xu, Minghua
Nie, Fan
Ni, Peng
Xiao, Chuan-Le
Luo, Feng
Wang, Jianxin
[J]. BIOINFORMATICS, 2023, 39 (01)
[6] Improved computations for relationship inference using low-coverage sequencing data
Mostad, Petter
Tillmar, Andreas
Kling, Daniel
[J]. BMC BIOINFORMATICS, 2023, 24 (01)
[7] Improved computations for relationship inference using low-coverage sequencing data
Petter Mostad
Andreas Tillmar
Daniel Kling
[J]. BMC Bioinformatics, 24
[8] Variant calling in low-coverage whole genome sequencing of a Native American population sample
Bizon, Chris
Spiegel, Michael
Chasse, Scott A.
Gizer, Ian R.
Li, Yun
Malc, Ewa P.
Mieczkowski, Piotr A.
Sailsbery, Josh K.
Wang, Xiaoshu
Ehlers, Cindy L.
Wilhelmsen, Kirk C.
[J]. BMC GENOMICS, 2014, 15
[9] Linkage disequilibrium based genotype calling from low-coverage shotgun sequencing reads
Duitama, Jorge
Kennedy, Justin
Dinakar, Sanjiv
Hernandez, Yoezen
Wu, Yufeng
Mandoiu, Ion I.
[J]. BMC BIOINFORMATICS, 2011, 12
[10] Variant calling in low-coverage whole genome sequencing of a Native American population sample
Chris Bizon
Michael Spiegel
Scott A Chasse
Ian R Gizer
Yun Li
Ewa P Malc
Piotr A Mieczkowski
Josh K Sailsbery
Xiaoshu Wang
Cindy L Ehlers
Kirk C Wilhelmsen
[J]. BMC Genomics, 15

← 1 2 3 4 5 →