Comparative analysis of metagenomic classifiers for long-read sequencing datasets

被引:8
|
作者
Maric, Josip [1 ]
Krizanovic, Kresimir [1 ]
Riondet, Sylvain [2 ,3 ]
Nagarajan, Niranjan [2 ,3 ]
Sikic, Mile [1 ,2 ]
机构
[1] Univ Zagreb, Fac Elect Engn & Comp, Unska 3, Zagreb 10000, Croatia
[2] ASTAR, Genome Inst Singapore GIS, 60 Biopolis St, Singapore 138672, Singapore
[3] Natl Univ Singapore, Yong Loo Lin Sch Med, Singapore 117596, Singapore
关键词
Metagenomics; Long sequenced reads; Classification; Benchmark; Abundance; CLASSIFICATION;
D O I
10.1186/s12859-024-05634-8
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
BackgroundLong reads have gained popularity in the analysis of metagenomics data. Therefore, we comprehensively assessed metagenomics classification tools on the species taxonomic level. We analysed kmer-based tools, mapping-based tools and two general-purpose long reads mappers. We evaluated more than 20 pipelines which use either nucleotide or protein databases and selected 13 for an extensive benchmark. We prepared seven synthetic datasets to test various scenarios, including the presence of a host, unknown species and related species. Moreover, we used available sequencing data from three well-defined mock communities, including a dataset with abundance varying from 0.0001 to 20% and six real gut microbiomes.ResultsGeneral-purpose mappers Minimap2 and Ram achieved similar or better accuracy on most testing metrics than best-performing classification tools. They were up to ten times slower than the fastest kmer-based tools requiring up to four times less RAM. All tested tools were prone to report organisms not present in datasets, except CLARK-S, and they underperformed in the case of the high presence of the host's genetic material. Tools which use a protein database performed worse than those based on a nucleotide database. Longer read lengths made classification easier, but due to the difference in read length distributions among species, the usage of only the longest reads reduced the accuracy. The comparison of real gut microbiome datasets shows a similar abundance profiles for the same type of tools but discordance in the number of reported organisms and abundances between types. Most assessments showed the influence of database completeness on the reports.ConclusionThe findings indicate that kmer-based tools are well-suited for rapid analysis of long reads data. However, when heightened accuracy is essential, mappers demonstrate slightly superior performance, albeit at a considerably slower pace. Nevertheless, a combination of diverse categories of tools and databases will likely be necessary to analyse complex samples. Discrepancies observed among tools when applied to real gut datasets, as well as a reduced performance in cases where unknown species or a significant proportion of the host genome is present in the sample, highlight the need for continuous improvement of existing tools. Additionally, regular updates and curation of databases are important to ensure their effectiveness.
引用
收藏
页数:26
相关论文
共 50 条
  • [31] NanoGalaxy: Nanopore long-read sequencing data analysis in Galaxy
    de Koning, Willem
    Miladi, Milad
    Hiltemann, Saskia
    Heikema, Astrid
    Hays, John P.
    Flemming, Stephan
    van den Beek, Marius
    Mustafa, Dana A.
    Backofen, Rolf
    Gruening, Bjoern
    Stubbs, Andrew P.
    GIGASCIENCE, 2020, 9 (10):
  • [32] Assessment of episignature analysis using PacBio long-read sequencing
    Ivashchenko, Veronique
    Hampstead, Juliet
    Derks, Ronny
    Den Ouden, Amber
    Khazeeva, Gelana
    Van den Heuvel, Simone
    Timmermans, Raoul
    Galbany, Jordi Corominas
    Pfundt, Rolph
    Hofste, Tom
    Yntema, Helger
    Vissers, Lisenka
    Hoischen, Alexander
    Gilissen, Christian
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 1777 - 1778
  • [33] Biosynthetic potential of uncultured Antarctic soil bacteria revealed through long-read metagenomic sequencing
    Waschulin, Valentin
    Borsetto, Chiara
    James, Robert
    Newsham, Kevin K.
    Donadio, Stefano
    Corre, Christophe
    Wellington, Elizabeth
    ISME JOURNAL, 2022, 16 (01): : 101 - 111
  • [34] IsoTools: a flexible workflow for long-read transcriptome sequencing analysis
    Lienhard, Matthias
    van den Beucken, Twan
    Timmermann, Bernd
    Hochradel, Myriam
    Boerno, Stefan
    Caiment, Florian
    Vingron, Martin
    Herwig, Ralf
    BIOINFORMATICS, 2023, 39 (06)
  • [35] Long-Read Annotation: Automated Eukaryotic Genome Annotation Based on Long-Read cDNA Sequencing
    Cook, David E.
    Valle-Inclan, Jose Espejo
    Pajoro, Alice
    Rovenich, Hanna
    Thomma, Bart P. H. J.
    Faino, Luigi
    PLANT PHYSIOLOGY, 2019, 179 (01) : 38 - 54
  • [36] Comparative Evaluation of Genome Assemblers from Long-Read Sequencing for Plants and Crops
    Jung, Hyungtaek
    Jeon, Min-Seung
    Hodgett, Matthew
    Waterhouse, Peter
    Eyun, Seong-il
    JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY, 2020, 68 (29) : 7670 - 7677
  • [37] The application of long-read sequencing in clinical settings
    Josephine B. Oehler
    Helen Wright
    Zornitza Stark
    Andrew J. Mallett
    Ulf Schmitz
    Human Genomics, 17
  • [38] The application of long-read sequencing in clinical settings
    Oehler, Josephine B.
    Wright, Helen
    Stark, Zornitza
    Mallett, Andrew J.
    Schmitz, Ulf
    HUMAN GENOMICS, 2023, 17 (01)
  • [39] Method of the Year 2022: long-read sequencing
    Nature Methods, 2023, 20 : 1 - 1
  • [40] Long-Read Sequencing in Blood Group Genetics
    Thun, Gian Andri
    Gueuning, Morgan
    Mattle-Greminger, Maja P.
    TRANSFUSION MEDICINE AND HEMOTHERAPY, 2023, 50 (03) : 184 - 197