DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

被引：0

作者：

Russo, Elena Tea ^{[1
,2
]}

Barone, Federico ^{[1
,2
,3
]}

Bateman, Alex ^{[4
]}

Cozzini, Stefano ^{[2
]}

Punta, Marco ^{[5
,6
]}

Laio, Alessandro ^{[1
,7
]}

机构：

[1] SISSA, Trieste, Italy

[2] Area Sci Pk, Trieste, Italy

[3] Univ Trieste, Dept Math & Geosci, Trieste, Italy

[4] European Bioinformat Inst EBI, European Mol Biol Lab EMBL, Wellcome Genome Campus, Hinxton, England

[5] IRCCS San Raffaele Hosp, Ctr Omics Sci, Milan, Italy

[6] IRCCS San Raffaele Sci Inst, Div Immunol Transplantat & Infect Dis, Unit Immunogenet, Leukemia Genom & Immunobiol, Milan, Italy

[7] Abdus Salaam Int Ctr Theoret Phys, Trieste, Italy

来源：

PLOS ONE | 2022年 / 17卷 / 10期

关键词：

D O I：

暂无

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds similar to 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 80% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.

引用

页数：29

共 50 条

[21] Kpax3: Bayesian bi-clustering of large sequence datasets
Pessia, Alberto
Corander, Jukka
BIOINFORMATICS, 2018, 34 (12) : 2132 - 2133
[22] Develop and implement unsupervised learning through hybrid FFPA clustering in large-scale datasets
Somase, Kiran Pandurang
Imambi, S. Sagar
SOFT COMPUTING, 2021, 25 (01) : 277 - 290
[23] Develop and implement unsupervised learning through hybrid FFPA clustering in large-scale datasets
Kiran Pandurang Somase
S. Sagar Imambi
Soft Computing, 2021, 25 : 277 - 290
[24] Fast density peak clustering for large scale data based on kNN
Chen, Yewang
Hu, Xiaoliang
Fan, Wentao
Shen, Lianlian
Zhang, Zheng
Liu, Xin
Du, Jixiang
Li, Haibo
Chen, Yi
Li, Hailin
KNOWLEDGE-BASED SYSTEMS, 2020, 187
[25] An improved density biased sampling algorithm for clustering large-scale datasets
Sheng, K. (shengkaiyuan1991@163.com), 1600, Binary Information Press (11):
[26] Large scale protein sequence clustering - Not solved but solvable
Krause, Antje
CURRENT BIOINFORMATICS, 2006, 1 (02) : 247 - 254
[27] An Unsupervised Anomaly Detection Method Based on Density Peak Clustering for Rail Vehicle Door System
Shi, Wen
Lu, Ningyun
Jiang, Bin
Zhi, Youran
Xu, Zhixing
PROCEEDINGS OF THE 2019 31ST CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2019), 2019, : 1954 - 1959
[28] Protein family classification using structural and sequence information
Smith, SF
PROCEEDINGS OF THE 2004 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2004, : 168 - 174
[29] An Efficient Density Biased Sampling Algorithm for Clustering Large High-Dimensional Datasets
Qian, Xue-Zhong
Deng, Jie
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2015, 29 (08)
[30] kClust: fast and sensitive clustering of large protein sequence databases
Maria Hauser
Christian E Mayer
Johannes Söding
BMC Bioinformatics, 14

← 1 2 3 4 5 →