Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations

被引:7
|
作者
Cosma, Bianca-Maria [1 ]
Zade, Ramin Shirali Hossein [1 ]
Jordan, Erin Noel [1 ,2 ]
van Lent, Paul [1 ]
Peng, Chengyao [1 ]
Pillay, Stephanie [1 ]
Abeel, Thomas [1 ,3 ]
机构
[1] Delft Univ Technol, Delft Bioinformat Lab, Intelligent Syst, NL-2628 XE Delft, Netherlands
[2] TU Dortmund Univ, Tech Biochem, D-44227 Dortmund, Germany
[3] Broad Inst MIT & Harvard, Infect Dis & Microbiome Program, Cambridge, MA 02142 USA
来源
GIGASCIENCE | 2023年 / 12卷
关键词
de novo assembly; third-generation sequencing; benchmarking; eukaryote genomes;
D O I
10.1093/gigascience/giad100
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) has overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. However, the introduction of HiFi reads, which offer substantially reduced error rates, has provided a promising solution for more accurate assembly outcomes. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.Results We benchmarked state-of-the-art long-read de novo assemblers to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 12 real and 64 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio continuous long-read (CLR), PacBio high-fidelity (HiFi), and ONT sequencing to evaluate the assemblers. We include 5 commonly used long-read assemblers in our benchmark: Canu, Flye, Miniasm, Raven, and wtdbg2 for ONT and PacBio CLR reads. For PacBio HiFi reads , we include 5 state-of-the-art HiFi assemblers: HiCanu, Flye, Hifiasm, LJA, and MBG. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies and report that read length can, but does not always, positively impact assembly quality.Conclusions Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results show that overall Flye is the best-performing assembler for PacBio CLR and ONT reads, both on real and simulated data. Meanwhile, best-performing PacBio HiFi assemblers are Hifiasm and LJA. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Familial long-read sequencing increases yield of de novo mutations
    Noyes, Michelle D.
    Harvey, William T.
    Porubsky, David
    Sulovari, Arvis
    Li, Ruiyang
    Rose, Nicholas R.
    Audano, Peter A.
    Munson, Katherine M.
    Lewis, Alexandra P.
    Hoekzema, Kendra
    Mantere, Tuomo
    Graves-Lindsay, Tina A.
    Sanders, Ashley D.
    Goodwin, Sara
    Kramer, Melissa
    Mokrab, Younes
    Zody, Michael C.
    Hoischen, Alexander
    Korbel, Jan O.
    McCombie, W. Richard
    Eichler, Evan E.
    AMERICAN JOURNAL OF HUMAN GENETICS, 2022, 109 (04) : 631 - 646
  • [32] Comprehensive de novo mutation discovery with HiFi long-read sequencing
    Kucuk, Erdi
    van der Sanden, Bart P. G. H.
    O'Gorman, Luke
    Kwint, Michael
    Derks, Ronny
    Wenger, Aaron M.
    Lambert, Christine
    Chakraborty, Shreyasee
    Baybayan, Primo
    Rowell, William J.
    Brunner, Han G.
    Vissers, Lisenka E. L. M.
    Hoischen, Alexander
    Gilissen, Christian
    GENOME MEDICINE, 2023, 15 (01)
  • [33] Comprehensive de novo mutation discovery with HiFi long-read sequencing
    Erdi Kucuk
    Bart P. G. H. van der Sanden
    Luke O’Gorman
    Michael Kwint
    Ronny Derks
    Aaron M. Wenger
    Christine Lambert
    Shreyasee Chakraborty
    Primo Baybayan
    William J. Rowell
    Han G. Brunner
    Lisenka E. L. M. Vissers
    Alexander Hoischen
    Christian Gilissen
    Genome Medicine, 15
  • [34] Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes
    De Maio, Nicola
    Shaw, Liam P.
    Hubbard, Alasdair
    George, Sophie
    Sanderson, Nicholas D.
    Swann, Jeremy
    Wick, Ryan
    AbuOun, Manal
    Stubberfield, Emma
    Hoosdally, Sarah J.
    Crook, Derrick W.
    Peto, Timothy E. A.
    Sheppard, Anna E.
    Bailey, Mark J.
    Read, Daniel S.
    Anjum, Muna F.
    Walker, A. Sarah
    Stoesser, Nicole
    Brett, H.
    Bowes, M.
    Chau, K.
    Duggett, N.
    Gilson, D.
    Gweon, H. S.
    Floosdally, S.
    Kavanaugh, J.
    Jones, H.
    Sebra, R.
    Smith, R.
    Swann, J.
    Woodford, N.
    MICROBIAL GENOMICS, 2019, 5 (09):
  • [35] Long-Read Annotation: Automated Eukaryotic Genome Annotation Based on Long-Read cDNA Sequencing
    Cook, David E.
    Valle-Inclan, Jose Espejo
    Pajoro, Alice
    Rovenich, Hanna
    Thomma, Bart P. H. J.
    Faino, Luigi
    PLANT PHYSIOLOGY, 2019, 179 (01) : 38 - 54
  • [36] De Novo Long-Read Genome Assembly and Annotation of the Mosquito Gut-Dwelling Fungus, Smittium minutisporum
    Prakash, Anusha
    Wang, Yan
    GENOME BIOLOGY AND EVOLUTION, 2024, 16 (12):
  • [37] De novo assembly of human genomes with massively parallel short read sequencing
    Li, Ruiqiang
    Zhu, Hongmei
    Ruan, Jue
    Qian, Wubin
    Fang, Xiaodong
    Shi, Zhongbin
    Li, Yingrui
    Li, Shengting
    Shan, Gao
    Kristiansen, Karsten
    Li, Songgang
    Yang, Huanming
    Wang, Jian
    Wang, Jun
    GENOME RESEARCH, 2010, 20 (02) : 265 - 272
  • [38] Long-read sequencing of new Drosophila genomes
    Koch L.
    Nature Reviews Genetics, 2021, 22 (10) : 625 - 625
  • [39] Distributed de novo assembler for large-scale long-read datasets
    Goswami, Sayan
    Lee, Kisung
    Park, Seung-Jong
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 1166 - 1175
  • [40] Democratizing long-read genome assembly
    Kirsche, Melanie
    Schatz, Michael C.
    CELL SYSTEMS, 2021, 12 (10) : 945 - 947