Open-Source Sequence Clustering Methods Improve the State Of the Art

被引:120
|
作者
Kopylova, Evguenia [1 ]
Navas-Molina, Jose A. [1 ,2 ]
Mercier, Celine [3 ]
Xu, Zhenjiang Zech [1 ]
Mahe, Frederic [4 ]
He, Yan [5 ]
Zhou, Hong-Wei [5 ]
Rognes, Torbjorn [6 ,7 ]
Caporaso, J. Gregory [8 ]
Knight, Rob [1 ,2 ]
机构
[1] UCSD, Sch Med, Dept Pediat, La Jolla, CA 92093 USA
[2] Univ Calif San Diego, Dept Comp Sci & Engn, La Jolla, CA 92093 USA
[3] Univ Grenoble Alpes, Lab Ecol Alpine LECA, CNRS UMR 5553, Grenoble, France
[4] Univ Kaiserslautern, Dept Ecol, Kaiserslautern, Germany
[5] Southern Med Univ, Guangdong Prov Key Lab Trop Dis Res, State Key Lab Organ Failure Res, Dept Environm Hlth,Sch Publ Hlth & Trop Med, Guangzhou, Guangdong, Peoples R China
[6] Univ Oslo, Dept Informat, Oslo, Norway
[7] Natl Hosp Norway, Oslo Univ Hosp, Dept Microbiol, Oslo, Norway
[8] No Arizona Univ, Dept Biol Sci, Box 5640, Flagstaff, AZ 86011 USA
基金
美国国家卫生研究院;
关键词
sequence clustering; operational taxonomic units; microbial community analysis; amplicon sequencing; PROTEIN; GREENGENES; DIVERSITY; RESOURCE; PROGRAM; UNIFRAC; SEARCH;
D O I
10.1128/mSystems.00003-15
中图分类号
Q93 [微生物学];
学科分类号
071005 ; 100705 ;
摘要
Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH's most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP)
引用
收藏
页数:16
相关论文
共 50 条
  • [1] A comparative study of two open-source state-of-the-art geometric VOF methods
    Esteban, Adolfo
    Lopez, Joaquin
    Gomez, Pablo
    Zanzi, Claudio
    Roenby, Johan
    Hernandez, Julio
    [J]. COMPUTERS & FLUIDS, 2023, 250
  • [2] An Open-source State-of-the-art Toolbox for Broadcast News Diarization
    Rouvier, Mickael
    Dupuy, Gregor
    Gay, Paul
    Khoury, Elie
    Merlin, Teva
    Meignier, Sylvain
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1476 - 1480
  • [3] Replication of the bSTAR sequence and open-source implementation
    Lee, Nam G.
    Bauman, Grzegorz
    Bieri, Oliver
    Nayak, Krishna S.
    [J]. MAGNETIC RESONANCE IN MEDICINE, 2023, : 1464 - 1477
  • [4] Open-source methods: Peering through the clutter
    Bollinger, T
    Nelson, R
    Self, KM
    Turnbull, SJ
    [J]. IEEE SOFTWARE, 1999, 16 (04) : 8 - 11
  • [5] Study of State-of-the-art Open-source C/C++ Static Analysis Tools
    Li, Guang-Wei
    Yuan, Ting
    Li, Lian
    [J]. Ruan Jian Xue Bao/Journal of Software, 2022, 33 (06): : 2061 - 2081
  • [6] Certification of open-source software: A role for formal methods?
    Barbosa, Luis S.
    Cerone, Antonio
    Petrenko, Alexander K.
    Shaikh, Siraj A.
    [J]. COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2010, 25 (04): : 273 - 281
  • [7] EvoCluster: An Open-Source Nature-Inspired Optimization Clustering Framework
    Qaddoura R.
    Faris H.
    Aljarah I.
    Castillo P.A.
    [J]. SN Computer Science, 2021, 2 (3)
  • [8] VISDA:: an open-source caBIG™ analytical tool for data clustering and beyond
    Wang, Jiajing
    Li, Huai
    Zhu, Yitan
    Yousef, Malik
    Nebozhyn, Michael
    Showe, Michael
    Showe, Louise
    Xuan, Jianhua
    Clarke, Robert
    Wang, Yue
    [J]. BIOINFORMATICS, 2007, 23 (15) : 2024 - 2027
  • [9] State-of-the-art performance in text-independent speaker verification through open-source software
    Fauve, Benoit G. B.
    Matrouf, Driss
    Scheffer, Nicolas
    Bonastre, Jean-Francois
    Mason, John S. D.
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07): : 1960 - 1968
  • [10] Building an open-source environmental monitoring system - A review of state-of-the-art and directions for future research
    Sudantha, B. H.
    Warusavitharana, E. J.
    Ratnayake, G. R.
    Mahanama, P. K. S.
    Cannata, M.
    Strigaro, D.
    [J]. 2018 3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY RESEARCH (ICITR), 2018,