DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

被引:0
|
作者
Russo, Elena Tea [1 ,2 ]
Barone, Federico [1 ,2 ,3 ]
Bateman, Alex [4 ]
Cozzini, Stefano [2 ]
Punta, Marco [5 ,6 ]
Laio, Alessandro [1 ,7 ]
机构
[1] SISSA, Trieste, Italy
[2] Area Sci Pk, Trieste, Italy
[3] Univ Trieste, Dept Math & Geosci, Trieste, Italy
[4] European Bioinformat Inst EBI, European Mol Biol Lab EMBL, Wellcome Genome Campus, Hinxton, England
[5] IRCCS San Raffaele Hosp, Ctr Omics Sci, Milan, Italy
[6] IRCCS San Raffaele Sci Inst, Div Immunol Transplantat & Infect Dis, Unit Immunogenet, Leukemia Genom & Immunobiol, Milan, Italy
[7] Abdus Salaam Int Ctr Theoret Phys, Trieste, Italy
来源
PLOS ONE | 2022年 / 17卷 / 10期
关键词
D O I
暂无
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds similar to 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 80% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
引用
收藏
页数:29
相关论文
共 50 条
  • [1] DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
    Russoid, Elena Tea
    Barone, Federico
    Bateman, Alex
    Cozzini, Stefano
    Punta, Marco
    Laio, Alessandro
    PLOS COMPUTATIONAL BIOLOGY, 2022, 18 (10)
  • [2] Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering
    Barone, Federico
    Russo, Elena Tea
    Garcia, Edith Natalia Villegas
    Punta, Marco
    Cozzini, Stefano
    Ansuini, Alessio
    Cazzaniga, Alberto
    SCIENTIFIC DATA, 2024, 11 (01)
  • [3] Unsupervised textural classification of rocks in large imagery datasets
    Merrill-Cifuentes, Javier
    Cracknell, Matthew J.
    Escolme, Angela
    MINERALS ENGINEERING, 2022, 180
  • [4] Classification and Analysis of Clustering Algorithms for Large Datasets
    Badase, P. S.
    Deshbhratar, G. P.
    Bhagat, A. P.
    2015 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2015,
  • [5] Unsupervised clustering algorithm for databases based on density peak optimisation
    Pu, Xiaochuan
    Seo, Wonchul
    Ruan, Qingqiang
    INTERNATIONAL JOURNAL OF AUTONOMOUS AND ADAPTIVE COMMUNICATIONS SYSTEMS, 2023, 16 (03) : 313 - 326
  • [6] A Density Peak Clustering Approach to Unsupervised Acoustic Subword Units Discovery
    Yu, Jia
    Xie, Lei
    Xiao, Xiong
    Chng, Eng Siong
    Li, Haizhou
    2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2015, : 178 - 183
  • [7] An entropy-based density peak clustering for numerical gene expression datasets
    Maheshwari, Rashmi
    Mishra, Amaresh Chandra
    Mohanty, Sraban Kumar
    APPLIED SOFT COMPUTING, 2023, 142
  • [8] An Incremental Density-Based Clustering Technique for Large Datasets
    Rehman, Saif Ur
    Khan, Muhammed Naeem Ahmed
    COMPUTATIONAL INTELLIGENCE IN SECURITY FOR INFORMATION SYSTEMS 2010, 2010, 85 : 3 - 11
  • [9] Fast LDP-MST: An Efficient Density-Peak-Based Clustering Method for Large-Size Datasets
    Qiu, Teng
    Li, Yong-Jie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (05) : 4767 - 4780
  • [10] Unsupervised Dimension Reduction Methods for Protein Sequence Classification
    Heider, Dominik
    Bartenhagen, Christoph
    Dybowski, J. Nikolaj
    Hauke, Sascha
    Pyka, Martin
    Hoffmann, Daniel
    DATA ANALYSIS, MACHINE LEARNING AND KNOWLEDGE DISCOVERY, 2014, : 295 - 302