DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

被引:0
|
作者
Russo, Elena Tea [1 ,2 ]
Barone, Federico [1 ,2 ,3 ]
Bateman, Alex [4 ]
Cozzini, Stefano [2 ]
Punta, Marco [5 ,6 ]
Laio, Alessandro [1 ,7 ]
机构
[1] SISSA, Trieste, Italy
[2] Area Sci Pk, Trieste, Italy
[3] Univ Trieste, Dept Math & Geosci, Trieste, Italy
[4] European Bioinformat Inst EBI, European Mol Biol Lab EMBL, Wellcome Genome Campus, Hinxton, England
[5] IRCCS San Raffaele Hosp, Ctr Omics Sci, Milan, Italy
[6] IRCCS San Raffaele Sci Inst, Div Immunol Transplantat & Infect Dis, Unit Immunogenet, Leukemia Genom & Immunobiol, Milan, Italy
[7] Abdus Salaam Int Ctr Theoret Phys, Trieste, Italy
来源
PLOS ONE | 2022年 / 17卷 / 10期
关键词
D O I
暂无
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds similar to 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 80% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
引用
收藏
页数:29
相关论文
共 50 条
  • [31] kClust: fast and sensitive clustering of large protein sequence databases
    Hauser, Maria
    Mayer, Christian E.
    Soeding, Johannes
    BMC BIOINFORMATICS, 2013, 14
  • [32] Particle swarm Optimized Density-based Clustering and Classification: Supervised and unsupervised learning approaches
    Guan, Chun
    Yuen, Kevin Kam Fung
    Coenen, Frans
    SWARM AND EVOLUTIONARY COMPUTATION, 2019, 44 (876-896) : 876 - 896
  • [33] Improved density peak clustering for separation of multiple source partial discharge in large generators
    Li, Ya-Nan
    Li, Zhao-Hui
    MEASUREMENT SCIENCE AND TECHNOLOGY, 2020, 31 (01)
  • [34] flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding
    Ge, Yongchao
    Sealfon, Stuart C.
    BIOINFORMATICS, 2012, 28 (15) : 2052 - 2058
  • [35] AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets
    Mai, Son T.
    Assent, Ira
    Storgaard, Martin
    KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 1025 - 1034
  • [36] A comparative study of two density-based spatial clustering algorithms for very large datasets
    Wang, X
    Hamilton, HJ
    ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, 3501 : 120 - 132
  • [37] Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
    Sharpton, Thomas J.
    Jospin, Guillaume
    Wu, Dongying
    Langille, Morgan G. I.
    Pollard, Katherine S.
    Eisen, Jonathan A.
    BMC BIOINFORMATICS, 2012, 13
  • [38] Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
    Thomas J Sharpton
    Guillaume Jospin
    Dongying Wu
    Morgan GI Langille
    Katherine S Pollard
    Jonathan A Eisen
    BMC Bioinformatics, 13
  • [39] Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation
    Elena Tea Russo
    Alessandro Laio
    Marco Punta
    BMC Bioinformatics, 22
  • [40] Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation
    Russo, Elena Tea
    Laio, Alessandro
    Punta, Marco
    BMC BIOINFORMATICS, 2021, 22 (01)