Clustering huge protein sequence sets in linear time

被引：0

作者：

Martin Steinegger

Johannes Söding

机构：

[1] Max-Planck Institute for Biophysical Chemistry,Quantitative and Computational Biology group

[2] Technische Universität München,Department for Bioinformatics and Computational Biology

[3] Seoul National University,Department of Chemistry

来源：

Nature Communications | / 9卷

关键词：

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. Utilizing this enormous resource would require reducing its redundancy by similarity clustering. However, clustering hundreds of millions of sequences is impractical using current algorithms because their runtimes scale as the input set size N times the number of clusters K, which is typically of similar order as N, resulting in runtimes that increase almost quadratically with N. We developed Linclust, the first clustering algorithm whose runtime scales as N, independent of K. It can also cluster datasets several times larger than the available main memory. We cluster 1.6 billion metagenomic sequence fragments in 10 h on a single server to 50% sequence identity, >1000 times faster than has been possible before. Linclust will help to unlock the great wealth contained in metagenomic and genomic sequence databases.

引用

共 50 条

[31] SETS OF TERMS THAT DETERMINE ALL THE TERMS OF A LINEAR RECURRENCE SEQUENCE
KIMBERLING, C
FIBONACCI QUARTERLY, 1991, 29 (03): : 244 - 248
[32] Observations on complete sets between linear time and polynomial time
Hemmerling, Armin
INFORMATION AND COMPUTATION, 2011, 209 (02) : 173 - 182
[33] An efficient linear-time clustering algorithms
Wang, L
Zang, LJ
Song, RF
Proceedings of the 11th Joint International Computer Conference, 2005, : 678 - 681
[34] Real-Time Visual Navigation in Huge Image Sets Using Similarity Graphs
Barthel, Kai Uwe
Hezel, Nico
Schall, Konstantin
Jung, Klaus
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2202 - 2204
[35] Sets Clustering
Jubran, Ibrahim
Tukan, Murad
Maalouf, Alaa
Feldman, Dan
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
[36] Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
Loewenstein, Yaniv
Portugaly, Elon
Fromer, Menachem
Linial, Michal
BIOINFORMATICS, 2008, 24 (13) : I41 - I49
[37] Large scale protein sequence clustering - Not solved but solvable
Krause, Antje
CURRENT BIOINFORMATICS, 2006, 1 (02) : 247 - 254
[38] Clustering of proximal sequence space for the identification of protein families
Abascal, F
Valencia, A
BIOINFORMATICS, 2002, 18 (07) : 908 - 921
[39] A benchmark study of sequence alignment methods for protein clustering
Yingying Wang
Hongyan Wu
Yunpeng Cai
BMC Bioinformatics, 19
[40] Efficient Markov Clustering Algorithm for Protein Sequence Grouping
Szilagyi, Laszlo
Szilagyi, Sandor M.
2013 35TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2013, : 639 - 642

← 1 2 3 4 5 →