DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

被引:0
|
作者
Russo, Elena Tea [1 ,2 ]
Barone, Federico [1 ,2 ,3 ]
Bateman, Alex [4 ]
Cozzini, Stefano [2 ]
Punta, Marco [5 ,6 ]
Laio, Alessandro [1 ,7 ]
机构
[1] SISSA, Trieste, Italy
[2] Area Sci Pk, Trieste, Italy
[3] Univ Trieste, Dept Math & Geosci, Trieste, Italy
[4] European Bioinformat Inst EBI, European Mol Biol Lab EMBL, Wellcome Genome Campus, Hinxton, England
[5] IRCCS San Raffaele Hosp, Ctr Omics Sci, Milan, Italy
[6] IRCCS San Raffaele Sci Inst, Div Immunol Transplantat & Infect Dis, Unit Immunogenet, Leukemia Genom & Immunobiol, Milan, Italy
[7] Abdus Salaam Int Ctr Theoret Phys, Trieste, Italy
来源
PLOS ONE | 2022年 / 17卷 / 10期
关键词
D O I
暂无
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds similar to 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 80% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
引用
收藏
页数:29
相关论文
共 50 条
  • [21] Kpax3: Bayesian bi-clustering of large sequence datasets
    Pessia, Alberto
    Corander, Jukka
    BIOINFORMATICS, 2018, 34 (12) : 2132 - 2133
  • [22] Develop and implement unsupervised learning through hybrid FFPA clustering in large-scale datasets
    Somase, Kiran Pandurang
    Imambi, S. Sagar
    SOFT COMPUTING, 2021, 25 (01) : 277 - 290
  • [23] Develop and implement unsupervised learning through hybrid FFPA clustering in large-scale datasets
    Kiran Pandurang Somase
    S. Sagar Imambi
    Soft Computing, 2021, 25 : 277 - 290
  • [24] Fast density peak clustering for large scale data based on kNN
    Chen, Yewang
    Hu, Xiaoliang
    Fan, Wentao
    Shen, Lianlian
    Zhang, Zheng
    Liu, Xin
    Du, Jixiang
    Li, Haibo
    Chen, Yi
    Li, Hailin
    KNOWLEDGE-BASED SYSTEMS, 2020, 187
  • [25] An improved density biased sampling algorithm for clustering large-scale datasets
    Sheng, K. (shengkaiyuan1991@163.com), 1600, Binary Information Press (11):
  • [26] Large scale protein sequence clustering - Not solved but solvable
    Krause, Antje
    CURRENT BIOINFORMATICS, 2006, 1 (02) : 247 - 254
  • [27] An Unsupervised Anomaly Detection Method Based on Density Peak Clustering for Rail Vehicle Door System
    Shi, Wen
    Lu, Ningyun
    Jiang, Bin
    Zhi, Youran
    Xu, Zhixing
    PROCEEDINGS OF THE 2019 31ST CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2019), 2019, : 1954 - 1959
  • [28] Protein family classification using structural and sequence information
    Smith, SF
    PROCEEDINGS OF THE 2004 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2004, : 168 - 174
  • [29] An Efficient Density Biased Sampling Algorithm for Clustering Large High-Dimensional Datasets
    Qian, Xue-Zhong
    Deng, Jie
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2015, 29 (08)
  • [30] kClust: fast and sensitive clustering of large protein sequence databases
    Maria Hauser
    Christian E Mayer
    Johannes Söding
    BMC Bioinformatics, 14