A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis

被引：15

作者：

Akogwu, Isaac ^{[1
]}

Wang, Nan ^{[1
]}

Zhang, Chaoyang ^{[1
]}

Gong, Ping ^{[2
]}

机构：

[1] Univ Southern Mississippi, Sch Comp, Hattiesburg, MS 39406 USA

[2] US Army Engineer Res & Dev Ctr, Environm Lab, Vicksburg, MS 39180 USA

来源：

HUMAN GENOMICS | 2016年 / 10卷

基金：

美国国家科学基金会;

关键词：

Next-generation sequencing (NGS); k-mer; k-spectrum; Error correction; Sequence analysis; Bloom filter; BURROWS-WHEELER TRANSFORM; PAIRED READS; ALGORITHMS;

D O I：

10.1186/s40246-016-0068-0

中图分类号：

Q3 [遗传学];

学科分类号：

071007 ; 090102 ;

摘要：

Background: Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets. Methods: Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10x to 120x), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method. Results: Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50x), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets. Conclusions: This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.

引用

页数：11

共 50 条

[31] Cloud-Based Tools for Next-Generation Sequencing Data Analysis
Baker, Qanita Bani
Al-Rashdan, Wesam
Jararweh, Yaser
[J]. 2018 FIFTH INTERNATIONAL CONFERENCE ON SOCIAL NETWORKS ANALYSIS, MANAGEMENT AND SECURITY (SNAMS), 2018, : 99 - 105
[32] Bioinformatics Methods and Biological Interpretation for Next-Generation Sequencing Data
Wang, Guohua
Liu, Yunlong
Zhu, Dongxiao
Klau, Gunnar W.
Feng, Weixing
[J]. BIOMED RESEARCH INTERNATIONAL, 2015, 2015
[33] Extending KNIME for next-generation sequencing data analysis
Jagla, Bernd
Wiswedel, Bernd
Coppee, Jean-Yves
[J]. BIOINFORMATICS, 2011, 27 (20) : 2907 - 2909
[34] Systematic comparative study of computational methods for HLA typing from next-generation sequencing
Yu, Yuechun
Wang, Ke
Fahira, Aamir
Yang, Qiangzhen
Sun, Renliang
Li, Zhiqiang
Wang, Zhuo
Shi, Yongyong
[J]. HLA, 2021, 97 (06) : 481 - 492
[35] Next-generation sequencing data analysis on cloud computing
Kwon, Taesoo
Yoo, Won Gi
Lee, Won-Ja
Kim, Won
Kim, Dae-Won
[J]. GENES & GENOMICS, 2015, 37 (06) : 489 - 501
[36] Next-generation sequencing data analysis on cloud computing
Taesoo Kwon
Won Gi Yoo
Won-Ja Lee
Won Kim
Dae-Won Kim
[J]. Genes & Genomics, 2015, 37 : 489 - 501
[37] A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction
White, Maxwell
Rozovskaya, Alla
[J]. INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS, 2020, : 198 - 208
[38] Next-generation sequencing based newborn screening and comparative analysis with MS/MS
Guosong Shen
Wenwen Li
Yaqin Zhang
Lyuyan Chen
[J]. BMC Pediatrics, 24
[39] Next-generation sequencing based newborn screening and comparative analysis with MS/MS
Shen, Guosong
Li, Wenwen
Zhang, Yaqin
Chen, Lyuyan
[J]. BMC PEDIATRICS, 2024, 24 (01)
[40] Comparative analyses of error handling strategies for next-generation sequencing in precision medicine
Hannah F. Löchel
Dominik Heider
[J]. Scientific Reports, 10

← 1 2 3 4 5 →