Clustering huge protein sequence sets in linear time

被引:0
|
作者
Martin Steinegger
Johannes Söding
机构
[1] Max-Planck Institute for Biophysical Chemistry,Quantitative and Computational Biology group
[2] Technische Universität München,Department for Bioinformatics and Computational Biology
[3] Seoul National University,Department of Chemistry
来源
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. Utilizing this enormous resource would require reducing its redundancy by similarity clustering. However, clustering hundreds of millions of sequences is impractical using current algorithms because their runtimes scale as the input set size N times the number of clusters K, which is typically of similar order as N, resulting in runtimes that increase almost quadratically with N. We developed Linclust, the first clustering algorithm whose runtime scales as N, independent of K. It can also cluster datasets several times larger than the available main memory. We cluster 1.6 billion metagenomic sequence fragments in 10 h on a single server to 50% sequence identity, >1000 times faster than has been possible before. Linclust will help to unlock the great wealth contained in metagenomic and genomic sequence databases.
引用
收藏
相关论文
共 50 条
  • [31] SETS OF TERMS THAT DETERMINE ALL THE TERMS OF A LINEAR RECURRENCE SEQUENCE
    KIMBERLING, C
    FIBONACCI QUARTERLY, 1991, 29 (03): : 244 - 248
  • [32] Observations on complete sets between linear time and polynomial time
    Hemmerling, Armin
    INFORMATION AND COMPUTATION, 2011, 209 (02) : 173 - 182
  • [33] An efficient linear-time clustering algorithms
    Wang, L
    Zang, LJ
    Song, RF
    Proceedings of the 11th Joint International Computer Conference, 2005, : 678 - 681
  • [34] Real-Time Visual Navigation in Huge Image Sets Using Similarity Graphs
    Barthel, Kai Uwe
    Hezel, Nico
    Schall, Konstantin
    Jung, Klaus
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2202 - 2204
  • [35] Sets Clustering
    Jubran, Ibrahim
    Tukan, Murad
    Maalouf, Alaa
    Feldman, Dan
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [36] Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
    Loewenstein, Yaniv
    Portugaly, Elon
    Fromer, Menachem
    Linial, Michal
    BIOINFORMATICS, 2008, 24 (13) : I41 - I49
  • [37] Large scale protein sequence clustering - Not solved but solvable
    Krause, Antje
    CURRENT BIOINFORMATICS, 2006, 1 (02) : 247 - 254
  • [38] Clustering of proximal sequence space for the identification of protein families
    Abascal, F
    Valencia, A
    BIOINFORMATICS, 2002, 18 (07) : 908 - 921
  • [39] A benchmark study of sequence alignment methods for protein clustering
    Yingying Wang
    Hongyan Wu
    Yunpeng Cai
    BMC Bioinformatics, 19
  • [40] Efficient Markov Clustering Algorithm for Protein Sequence Grouping
    Szilagyi, Laszlo
    Szilagyi, Sandor M.
    2013 35TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2013, : 639 - 642