SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets

被引:1
|
作者
Luan, Tu [1 ,2 ]
Muralidharan, Harihara Subrahmaniam [1 ,2 ]
Alshehri, Marwan [1 ]
Mittra, Ipsa [1 ]
Pop, Mihai [1 ,2 ]
机构
[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[2] Univ Maryland, Ctr Bioinformat & Computat Biol, College Pk, MD 20742 USA
基金
美国国家卫生研究院;
关键词
IDENTIFICATION; INFERENCE; CATALOG; EST;
D O I
10.1093/nar/gkad158
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys
    Werner, Jeffrey J.
    Koren, Omry
    Hugenholtz, Philip
    DeSantis, Todd Z.
    Walters, William A.
    Caporaso, J. Gregory
    Angenent, Largus T.
    Knight, Rob
    Ley, Ruth E.
    ISME JOURNAL, 2012, 6 (01): : 94 - 103
  • [22] Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys
    Jeffrey J Werner
    Omry Koren
    Philip Hugenholtz
    Todd Z DeSantis
    William A Walters
    J Gregory Caporaso
    Largus T Angenent
    Rob Knight
    Ruth E Ley
    The ISME Journal, 2012, 6 : 94 - 103
  • [23] Latitudinal variation in the potential activity of Atlantic Ocean bacterioplankton revealed through 16S rRNA and 16S rRNA gene metabarcoding
    Allen, Ro
    Bird, Kimberley E.
    Murrell, J. Colin
    Cunliffe, Michael
    FRONTIERS IN MARINE SCIENCE, 2023, 10
  • [24] Activity profiles for marine sponge-associated bacteria obtained by 16S rRNA vs 16S rRNA gene comparisons
    Janine Kamke
    Michael W Taylor
    Susanne Schmitt
    The ISME Journal, 2010, 4 : 498 - 508
  • [25] Sequence diversity of Neisseria meningitidis 16S rRNA genes and use of 16S rRNA gene sequencing as a molecular subtyping tool
    Sacchi, CT
    Whitney, AM
    Reeves, MW
    Mayer, LW
    Popovic, T
    JOURNAL OF CLINICAL MICROBIOLOGY, 2002, 40 (12) : 4520 - 4527
  • [26] 16S rRNA sequence diversity in Mycobacterium celatum strains caused by presence of two different copies of 16S rRNA gene
    Reischl, U
    Feldmann, K
    Naumann, L
    Gaugler, BJM
    Ninet, B
    Hirschel, B
    Emler, S
    JOURNAL OF CLINICAL MICROBIOLOGY, 1998, 36 (06) : 1761 - 1764
  • [27] The evaluation of an identification algorithm for Mycobacterium species using the 16S rRNA coding gene and rpoB
    Kazumi, Yuko
    Mitarai, Satoshi
    INTERNATIONAL JOURNAL OF MYCOBACTERIOLOGY, 2012, 1 (01) : 21 - 28
  • [28] Copy number of the 16S rRNA gene in Coxiella burnetii
    Afseth, G
    Mallavia, LP
    EUROPEAN JOURNAL OF EPIDEMIOLOGY, 1997, 13 (06) : 729 - 731
  • [29] Activity profiles for marine sponge-associated bacteria obtained by 16S rRNA vs 16S rRNA gene comparisons
    Kamke, Janine
    Taylor, Michael W.
    Schmitt, Susanne
    ISME JOURNAL, 2010, 4 (04): : 498 - 508
  • [30] Intravaginal microbial flora by the 16S rRNA gene sequencing
    Yoshimura, Kazuaki
    Morotomi, Nobuo
    Fukuda, Kazumasa
    Nakano, Masahiro
    Kashimura, Masamichi
    Hachisuga, Toru
    Taniguchi, Hatsumi
    AMERICAN JOURNAL OF OBSTETRICS AND GYNECOLOGY, 2011, 205 (03) : 235.e1 - 235.e9