A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis

被引:15
|
作者
Akogwu, Isaac [1 ]
Wang, Nan [1 ]
Zhang, Chaoyang [1 ]
Gong, Ping [2 ]
机构
[1] Univ Southern Mississippi, Sch Comp, Hattiesburg, MS 39406 USA
[2] US Army Engineer Res & Dev Ctr, Environm Lab, Vicksburg, MS 39180 USA
基金
美国国家科学基金会;
关键词
Next-generation sequencing (NGS); k-mer; k-spectrum; Error correction; Sequence analysis; Bloom filter; BURROWS-WHEELER TRANSFORM; PAIRED READS; ALGORITHMS;
D O I
10.1186/s40246-016-0068-0
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Background: Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets. Methods: Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10x to 120x), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method. Results: Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50x), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets. Conclusions: This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis
    Isaac Akogwu
    Nan Wang
    Chaoyang Zhang
    Ping Gong
    [J]. Human Genomics, 10
  • [2] K-Mer Spectrum-Based Error Correction Algorithm for Next-Generation Sequencing Data
    AlEisa, Hussah N.
    Hamad, Safwat
    Elhadad, Ahmed
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [3] K-Mer Spectrum-Based Error Correction Algorithm for Next-Generation Sequencing Data
    AlEisa, Hussah N.
    Hamad, Safwat
    Elhadad, Ahmed
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [4] Benchmarking of computational error-correction methods for next-generation sequencing data
    Mitchell, Keith
    Brito, Jaqueline J.
    Mandric, Igor
    Wu, Qiaozhen
    Knyazev, Sergey
    Chang, Sei
    Martin, Lana S.
    Karlsberg, Aaron
    Gerasimov, Ekaterina
    Littman, Russell
    Hill, Brian L.
    Wu, Nicholas C.
    Yang, Harry
    Hsieh, Kevin
    Chen, Linus
    Littman, Eli
    Shabani, Taylor
    Enik, German
    Yao, Douglas
    Sun, Ren
    Schroeder, Jan
    Eskin, Eleazar
    Zelikovsky, Alex
    Skums, Pavel
    Pop, Mihai
    Mangul, Serghei
    [J]. ACM-BCB 2020 - 11TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2020,
  • [5] Benchmarking of computational error-correction methods for next-generation sequencing data
    Keith Mitchell
    Jaqueline J. Brito
    Igor Mandric
    Qiaozhen Wu
    Sergey Knyazev
    Sei Chang
    Lana S. Martin
    Aaron Karlsberg
    Ekaterina Gerasimov
    Russell Littman
    Brian L. Hill
    Nicholas C. Wu
    Harry Taegyun Yang
    Kevin Hsieh
    Linus Chen
    Eli Littman
    Taylor Shabani
    German Enik
    Douglas Yao
    Ren Sun
    Jan Schroeder
    Eleazar Eskin
    Alex Zelikovsky
    Pavel Skums
    Mihai Pop
    Serghei Mangul
    [J]. Genome Biology, 21
  • [6] Benchmarking of computational error-correction methods for next-generation sequencing data
    Mitchell, Keith
    Brito, Jaqueline J.
    Mandric, Igor
    Wu, Qiaozhen
    Knyazev, Sergey
    Chang, Sei
    Martin, Lana S.
    Karlsberg, Aaron
    Gerasimov, Ekaterina
    Littman, Russell
    Hill, Brian L.
    Wu, Nicholas C.
    Yang, Harry Taegyun
    Hsieh, Kevin
    Chen, Linus
    Littman, Eli
    Shabani, Taylor
    Enik, German
    Yao, Douglas
    Sun, Ren
    Schroeder, Jan
    Eskin, Eleazar
    Zelikovsky, Alex
    Skums, Pavel
    Pop, Mihai
    Mangul, Serghei
    [J]. GENOME BIOLOGY, 2020, 21 (01)
  • [7] A survey of error-correction methods for next-generation sequencing
    Yang, Xiao
    Chockalingam, Sriram P.
    Aluru, Srinivas
    [J]. BRIEFINGS IN BIOINFORMATICS, 2013, 14 (01) : 56 - 66
  • [8] MapReduce for accurate error correction of next-generation sequencing data
    Zhao, Liang
    Chen, Qingfeng
    Li, Wencui
    Jiang, Peng
    Wong, Limsoon
    Li, Jinyan
    [J]. BIOINFORMATICS, 2017, 33 (23) : 3844 - 3851
  • [9] Factorial Analysis of Error Correction Performance Using Simulated Next-Generation Sequencing Data
    Akogwu, Isaac
    Wang, Nan
    Zhang, Chaoyang
    Hong, Huixiao
    Choi, Hwanseok
    Gong, Ping
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 1164 - 1169
  • [10] Effects of error-correction of heterozygous next-generation sequencing data
    Fujimoto, M. Stanley
    Bodily, Paul M.
    Okuda, Nozomu
    Clement, Mark J.
    Snell, Quinn
    [J]. BMC BIOINFORMATICS, 2014, 15