DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

被引:0
|
作者
Russo, Elena Tea [1 ,2 ]
Barone, Federico [1 ,2 ,3 ]
Bateman, Alex [4 ]
Cozzini, Stefano [2 ]
Punta, Marco [5 ,6 ]
Laio, Alessandro [1 ,7 ]
机构
[1] SISSA, Trieste, Italy
[2] Area Sci Pk, Trieste, Italy
[3] Univ Trieste, Dept Math & Geosci, Trieste, Italy
[4] European Bioinformat Inst EBI, European Mol Biol Lab EMBL, Wellcome Genome Campus, Hinxton, England
[5] IRCCS San Raffaele Hosp, Ctr Omics Sci, Milan, Italy
[6] IRCCS San Raffaele Sci Inst, Div Immunol Transplantat & Infect Dis, Unit Immunogenet, Leukemia Genom & Immunobiol, Milan, Italy
[7] Abdus Salaam Int Ctr Theoret Phys, Trieste, Italy
来源
PLOS ONE | 2022年 / 17卷 / 10期
关键词
D O I
暂无
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds similar to 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 80% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
引用
收藏
页数:29
相关论文
共 50 条
  • [41] A hybrid framework for protein sequence clustering and classification using signature motif information
    Chen, Wei-Bang
    Zhang, Chengcui
    INTEGRATED COMPUTER-AIDED ENGINEERING, 2009, 16 (04) : 353 - 365
  • [42] Few-Shot Unsupervised Specific Emitter Identification Based on Density Peak Clustering Algorithm and Meta-Learning
    Xie, Cunxiang
    Zhang, Limin
    Zhong, Zhaogen
    IEEE SENSORS JOURNAL, 2022, 22 (18) : 18008 - 18020
  • [43] Family clustering of Baeyer-Villiger monooxygenases based on protein sequence and stereopreference
    Mihovilovic, MD
    Rudroff, F
    Grötzl, B
    Kapitan, P
    Snajdrova, R
    Rydz, J
    Mach, R
    ANGEWANDTE CHEMIE-INTERNATIONAL EDITION, 2005, 44 (23) : 3609 - 3613
  • [44] RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
    Bruno Thiago de Lima Nichio
    Aryel Marlus Repula de Oliveira
    Camilla Reginatto de Pierri
    Leticia Graziela Costa Santos
    Alexandre Quadros Lejambre
    Ricardo Assunção Vialle
    Nilson Antônio da Rocha Coimbra
    Dieval Guizelini
    Jeroniza Nunes Marchaukoski
    Fabio de Oliveira Pedrosa
    Roberto Tadeu Raittz
    BMC Bioinformatics, 20
  • [45] An improved density-based single sliding clustering algorithm for large datasets in the cultural information system
    Tolba, Amr
    Al-Makhadmeh, Zafer
    PERSONAL AND UBIQUITOUS COMPUTING, 2020, 24 (01) : 33 - 44
  • [46] RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
    de Lima Nichio, Bruno Thiago
    Repula de Oliveira, Aryel Marlus
    de Pierri, Camilla Reginatto
    Costa Santos, Leticia Graziela
    Lejambre, Alexandre Quadros
    Vialle, Ricardo Assuncao
    da Rocha Coimbra, Nilson Antonio
    Guizelini, Dieval
    Marchaukoski, Jeroniza Nunes
    Pedrosa, Fabio de Oliveira
    Raittz, Roberto Tadeu
    BMC BIOINFORMATICS, 2019, 20 (1)
  • [47] An improved density-based single sliding clustering algorithm for large datasets in the cultural information system
    Amr Tolba
    Zafer Al-Makhadmeh
    Personal and Ubiquitous Computing, 2020, 24 : 33 - 44
  • [48] Unsupervised classification of polarimetric SAR imagery using large-scale spectral clustering with spatial constraints
    Song, H.
    Yang, W.
    Bai, Y.
    Xu, X.
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2015, 36 (11) : 2816 - 2830
  • [49] Sequence motif identification and protein family classification using probabilistic trees
    Leonardi, F
    Galves, A
    ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, PROCEEDINGS, 2005, 3594 : 190 - 193
  • [50] Semantic Information Classification of IoT Perception Data Based on Density Peak Fast Search Clustering Algorithm
    Chen, Lin
    Hu, Jinli
    Wang, Weisheng
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (02) : 782 - 791