SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets

被引:1
|
作者
Luan, Tu [1 ,2 ]
Muralidharan, Harihara Subrahmaniam [1 ,2 ]
Alshehri, Marwan [1 ]
Mittra, Ipsa [1 ]
Pop, Mihai [1 ,2 ]
机构
[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[2] Univ Maryland, Ctr Bioinformat & Computat Biol, College Pk, MD 20742 USA
基金
美国国家卫生研究院;
关键词
IDENTIFICATION; INFERENCE; CATALOG; EST;
D O I
10.1093/nar/gkad158
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Phylogenetic Clustering of Soil Microbial Communities by 16S rRNA but Not 16S rRNA Genes
    DeAngelis, Kristen M.
    Firestone, Mary K.
    APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2012, 78 (07) : 2459 - 2461
  • [2] metaSPARSim: a 16S rRNA gene sequencing count data simulator
    Ilaria Patuzzi
    Giacomo Baruzzo
    Carmen Losasso
    Antonia Ricci
    Barbara Di Camillo
    BMC Bioinformatics, 20
  • [3] metaSPARSim: a 16S rRNA gene sequencing count data simulator
    Patuzzi, Ilaria
    Baruzzo, Giacomo
    Losasso, Carmen
    Ricci, Antonia
    Di Camillo, Barbara
    BMC BIOINFORMATICS, 2019, 20 (Suppl 9)
  • [4] HashSeq: a Simple, Scalable, and Conservative De Novo Variant Caller for 16S rRNA Gene Data Sets
    Fouladi, Farnaz
    Young, Jacqueline B.
    Fodor, Anthony A.
    MSYSTEMS, 2021, 6 (06)
  • [5] A renaissance for the pioneering 16S rRNA gene
    Tringe, Susannah G.
    Hugenholtz, Philip
    CURRENT OPINION IN MICROBIOLOGY, 2008, 11 (05) : 442 - 446
  • [6] Defining Reference Sequences for Nocardia Species by Similarity and Clustering Analyses of 16S rRNA Gene Sequence Data
    Helal, Manal
    Kong, Fanrong
    Chen, Sharon C. A.
    Bain, Michael
    Christen, Richard
    Sintchenko, Vitali
    PLOS ONE, 2011, 6 (06):
  • [7] Community analysis of picocyanobacteria in an oligotrophic lake by cloning 16S rRNA gene and 16S rRNA gene amplicon sequencing
    Fujimoto, Naoshi
    Mizuno, Keigo
    Yokoyama, Tomoki
    Ohnishi, Akihiro
    Suzuki, Masaharu
    Watanabe, Satoru
    Komatsu, Kenji
    Sakata, Yoichi
    Kishida, Naohiro
    Akiba, Michihiro
    Matsukura, Satoko
    JOURNAL OF GENERAL AND APPLIED MICROBIOLOGY, 2015, 61 (05): : 171 - 176
  • [8] Analysis of large 16S rRNA Illumina data sets: Impact of singleton read filtering on microbial community description
    Auer, Lucas
    Mariadassou, Mahendra
    O'Donohue, Michael
    Klopp, Christophe
    Hernandez-Raquet, Guillermina
    MOLECULAR ECOLOGY RESOURCES, 2017, 17 (06) : e122 - e132
  • [9] DySC: software for greedy clustering of 16S rRNA reads
    Zheng, Zejun
    Kramer, Stefan
    Schmidt, Bertil
    BIOINFORMATICS, 2012, 28 (16) : 2182 - 2183
  • [10] A Comparison of Methods for Clustering 16S rRNA Sequences into OTUs
    Chen, Wei
    Zhang, Clarence K.
    Cheng, Yongmei
    Zhang, Shaowu
    Zhao, Hongyu
    PLOS ONE, 2013, 8 (08):