Partitioning clustering algorithms for protein sequence data sets

被引:6
|
作者
Fayech, Sondes [1 ]
Essoussi, Nadia [1 ]
Limam, Mohamed [1 ]
机构
[1] Univ Tunis, Dept Comp Sci, Higher Inst Management, LARODEC Lab, Tunis, Tunisia
来源
BIODATA MINING | 2009年 / 2卷
关键词
CLASSIFICATION; IMPROVEMENTS; SEARCH;
D O I
10.1186/1756-0381-2-3
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods. Methods: We developed four partitioning clustering approaches using Smith-Waterman local-alignment algorithm to determine pair-wise similarities of sequences. Four different sets of protein sequences were used as evaluation data sets for the proposed methods. Results: We show that these methods outperform several other published clustering methods in terms of correctly predicting a classifier and especially in terms of the correctness of the provided prediction. The software is available to academic users from the authors upon request.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] A clustering system for data sequence partitioning
    Wang, Yu-Jie
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (01) : 659 - 666
  • [2] Empirical comparison of fast partitioning-based clustering algorithms for large data sets
    Wei, CP
    Lee, YH
    Hsu, CM
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2003, 24 (04) : 351 - 363
  • [3] Clustering Algorithms for Large Temporal Data Sets
    Scepi, Germana
    [J]. DATA ANALYSIS AND CLASSIFICATION, 2010, : 369 - 377
  • [4] A Comparative Study of Protein Sequence Clustering Algorithms
    Eldin, A. Sharaf
    AbdelGaber, S.
    Soliman, T.
    Kassim, S.
    Abdo, A.
    [J]. INNOVATIONS IN COMPUTING SCIENCES AND SOFTWARE ENGINEERING, 2010, : 373 - 378
  • [5] Hybrid clustering approach for term partitioning in document data sets
    Dept. of Computer Science and Engineering, GITAM, Visakhapatnam, India
    不详
    不详
    [J]. J. Digit. Inf. Manage., 2008, 3 (272-277): : 272 - 277
  • [6] A fast hierarchical clustering algorithm for large-scale protein sequence data sets
    Szilagyi, Sandor M.
    Szilagyi, Laszlo
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2014, 48 : 94 - 101
  • [7] Clustering huge protein sequence sets in linear time
    Martin Steinegger
    Johannes Söding
    [J]. Nature Communications, 9
  • [8] Clustering huge protein sequence sets in linear time
    Steinegger, Martin
    Soeding, Johannes
    [J]. NATURE COMMUNICATIONS, 2018, 9
  • [9] PARTITIONING ALGORITHMS FOR FINITE SETS
    HUTCHINSON, G
    [J]. COMMUNICATIONS OF THE ACM, 1963, 6 (10) : 613 - 614
  • [10] Scaling clustering algorithms for massive data sets using data streams
    Nittel, S
    Leung, KT
    Braverman, A
    [J]. 20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2004, : 830 - 830