SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets

被引:1
|
作者
Luan, Tu [1 ,2 ]
Muralidharan, Harihara Subrahmaniam [1 ,2 ]
Alshehri, Marwan [1 ]
Mittra, Ipsa [1 ]
Pop, Mihai [1 ,2 ]
机构
[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[2] Univ Maryland, Ctr Bioinformat & Computat Biol, College Pk, MD 20742 USA
基金
美国国家卫生研究院;
关键词
IDENTIFICATION; INFERENCE; CATALOG; EST;
D O I
10.1093/nar/gkad158
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Toolbox Approaches Using Molecular Markers and 16S rRNA Gene Amplicon Data Sets for Identification of Fecal Pollution in Surface Water
    Ahmed, W.
    Staley, C.
    Sadowsky, M. J.
    Gyawali, P.
    Sidhu, J. P. S.
    Palmer, A.
    Beale, D. J.
    Toze, S.
    APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2015, 81 (20) : 7067 - 7077
  • [42] 16S rRNA gene and 16S-23S rRNA gene internal transcribed spacer sequences analysis of the genus Myxococcus
    Miyashita, Mika
    Sakane, Takeshi
    Suzuki, Ken-ichiro
    Nakagawa, Yasuyoshi
    FEMS MICROBIOLOGY LETTERS, 2008, 282 (02) : 241 - 245
  • [43] Data on microbial diversity of camel milk microbiota determined by 16S rRNA gene sequencing
    Rahmeh, Rita
    Akbar, Abrar
    Alomirah, Husam
    Kishk, Mohamed
    Al-Ateeqi, Abdulaziz
    Al-Milhm, Salah
    Shajan, Anisha
    Akbar, Batool
    Al-Merri, Shafeah
    Alotaibi, Mohammad
    Esposito, Alfonso
    DATA IN BRIEF, 2022, 45
  • [44] MSClust: A Multi-Seeds based Clustering algorithm for microbiome profiling using 16S rRNA sequence
    Chen, Wei
    Cheng, Yongmei
    Zhang, Clarence
    Zhang, Shaowu
    Zhao, Hongyu
    JOURNAL OF MICROBIOLOGICAL METHODS, 2013, 94 (03) : 347 - 355
  • [45] C16S-A Hidden Markov Model based algorithm for taxonomic classification of 16S rRNA gene sequences
    Ghosh, Tarini Shankar
    Gajjalla, Purnachander
    Mohammed, Monzoorul Haque
    Mande, Sharmila S.
    GENOMICS, 2012, 99 (04) : 195 - 201
  • [46] 16S rRNA gene amplicon sequencing data from an Australian wastewater treatment plant
    Romanis, C. S.
    Timms, V. J.
    Crosbie, N. D.
    Neilan, B. A.
    MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 2024, 13 (06):
  • [47] Short Read Lengths Recover Ecological Patterns in 16S rRNA Gene Amplicon Data
    Jurburg, Stephanie D.
    MOLECULAR ECOLOGY RESOURCES, 2025,
  • [48] 16S rRNA gene amplicon sequence data from sunflower endosphere bacterial community
    Babalola, Olubukola Oluranti
    Adeleke, Bartholomew Saanu
    Ayangbenro, Ayansina Segun
    DATA IN BRIEF, 2021, 39
  • [49] Diversity of 16S rRNA gene, ITS region and aclB gene of the Aquificales
    I. Ferrera
    S. Longhorn
    A. B. Banta
    Y. Liu
    D. Preston
    A.-L. Reysenbach
    Extremophiles, 2007, 11 : 57 - 64
  • [50] Diversity of 16S rRNA gene, ITS region and aclB gene of the Aquificales
    Ferrera, I.
    Longhorn, S.
    Banta, A. B.
    Liu, Y.
    Preston, D.
    Reysenbach, A. -L.
    EXTREMOPHILES, 2007, 11 (01) : 57 - 64