Comparing a few SNP calling algorithms using low-coverage sequencing data

被引：83

作者：

Yu, Xiaoqing ^{[1
]}

Sun, Shuying ^{[1
,2
]}

机构：

[1] Case Western Reserve Univ, Dept Epidemiol & Biostat, Cleveland, OH 44106 USA

[2] Texas State Univ, Dept Math, San Marcos, TX 78666 USA

来源：

BMC BIOINFORMATICS | 2013年 / 14卷

关键词：

Next generation sequencing; SNP calling; Low-coverage; Single-sample; SOAPsnp; Atlas-SNP2; SAMtools; GATK; ASSOCIATION; DISCOVERY; VARIANTS; SUSCEPTIBILITY; GENE; POLYMORPHISM; FRAMEWORK; DISEASE; LOCI;

D O I：

10.1186/1471-2105-14-274

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations. Results: To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs' quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs. Conclusions: Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.

引用

页数：15

共 50 条

[41] Next-Generation Sequencing Data Analysis on Pool-Seq and Low-Coverage Retinoblastoma Data
Gülistan Özdemir Özdoğan
Hilal Kaya
[J]. Interdisciplinary Sciences: Computational Life Sciences, 2020, 12 : 302 - 310
[42] Phylogenomics from low-coverage whole-genome sequencing
Zhang, Feng
Ding, Yinhuan
Zhu, Chao-Dong
Zhou, Xin
Orr, Michael C.
Scheu, Stefan
Luan, Yun-Xia
[J]. METHODS IN ECOLOGY AND EVOLUTION, 2019, 10 (04): : 507 - 517
[43] Powerful eQTL mapping through low-coverage RNA sequencing
Schwarz, Tommer
Boltz, Toni
Hou, Kangcheng
Bot, Merel
Duan, Chenda
Loohuis, Loes Olde
Boks, Marco P.
Kahn, Rene S.
Ophoff, Roel A.
Pasaniuc, Bogdan
[J]. HUMAN GENETICS AND GENOMICS ADVANCES, 2022, 3 (03):
[44] Low-Coverage Sequencing Imputation from millions of reference samples
Rubinacci, Simone
Delaneau, Olivier
[J]. HUMAN HEREDITY, 2022, VOL. (SUPPL 1) : 4 - 5
[45] Rare Variant Association Testing Under Low-Coverage Sequencing
Navon, Oron
Sul, Jae Hoon
Han, Buhm
Conde, Lucia
Bracci, Paige M.
Riby, Jacques
Skibola, Christine F.
Eskin, Eleazar
Halperin, Eran
[J]. GENETICS, 2013, 194 (03): : 769 - +
[46] Detecting inherited and novel structural variants in low-coverage parent-child sequencing data
Spence, Melissa
Banuelos, Mario
Marcia, Roummel F.
Sindi, Suzanne
[J]. METHODS, 2020, 173 : 61 - 68
[47] AKSmooth: Enhancing low-coverage bisulfite sequencing data via kernel-based smoothing
Chen, Junfang
Lutsik, Pavlo
Akulenko, Ruslan
Walter, Joern
Helms, Volkhard
[J]. JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2014, 12 (06)
[48] Detecting Pathogenic Structural Variants with Low-Coverage PacBio Sequencing
Hickey, L.
Wenger, A. M.
Baybayan, P.
Peluso, P.
Korlach, J.
[J]. EUROPEAN JOURNAL OF HUMAN GENETICS, 2018, 26 : 729 - 729
[49] Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes
Rubinacci, Simone
Hofmeister, Robin J.
da Mota, Barbara Sousa
Delaneau, Olivier
[J]. EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 50 - 50
[50] A novel nonlinear dimension reduction approach to infer population structure for low-coverage sequencing data
Miao Zhang
Yiwen Liu
Hua Zhou
Joseph Watkins
Jin Zhou
[J]. BMC Bioinformatics, 22

← 1 2 3 4 5 →