SEQOPTICS: a protein sequence clustering system

被引:14
|
作者
Chen, Yonghui [1 ]
Reilly, Kevin D.
Sprague, Alan P.
Guan, Zhijie
机构
[1] Univ Alabama Birmingham, Dept Comp & Informat Sci, Birmingham, AL 35294 USA
[2] Univ Calif San Diego, San Diego Supercomp Ctr, La Jolla, CA 92093 USA
关键词
D O I
10.1186/1471-2105-7-S4-S10
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. In most cases single linkage or graph-based clustering algorithms have been applied. OPTICS (Ordering Points To Identify the Clustering Structure) is an attractive approach due to its emphasis on visualization of results and support for interactive work, e. g., in choosing parameters. However, OPTICS has not been used, as far as we know, for protein sequence clustering. Results: In this paper, a system of clustering proteins, SEQOPTICS (SEQuence clustering with OPTICS) is demonstrated. The system is implemented with Smith-Waterman as protein distance measurement and OPTICS at its core to perform protein sequence clustering. SEQOPTICS is tested with four data sets from different data sources. Visualization of the sequence clustering structure is demonstrated as well. Conclusion: The system was evaluated by comparison with other existing methods. Analysis of the results demonstrates that SEQOPTICS performs better based on some evaluation criteria including Jaccard coefficient, Precision, and Recall. It is a promising protein sequence clustering method with future possible improvement on parallel computing and other protein distance measurements.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
    Ma, QC
    Chirn, GW
    Cai, R
    Szustakowski, JD
    Nirmala, NR
    BMC BIOINFORMATICS, 2005, 6 (1)
  • [42] Seismicity clustering of sequence phenomena in the active tectonic system of backthrust Lombok preceding the sequence 2018 earthquakes
    Andrean V. H. Simanjuntak
    Kutubuddin Ansari
    Arabian Journal of Geosciences, 2022, 15 (23)
  • [43] A fast hierarchical clustering algorithm for large-scale protein sequence data sets
    Szilagyi, Sandor M.
    Szilagyi, Laszlo
    COMPUTERS IN BIOLOGY AND MEDICINE, 2014, 48 : 94 - 101
  • [44] MMseqs software suite for fast and deep clustering and searching of large protein sequence sets
    Hauser, Maria
    Steinegger, Martin
    Soeding, Johannes
    BIOINFORMATICS, 2016, 32 (09) : 1323 - 1330
  • [45] DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
    Russo, Elena Tea
    Barone, Federico
    Bateman, Alex
    Cozzini, Stefano
    Punta, Marco
    Laio, Alessandro
    PLOS ONE, 2022, 17 (10):
  • [46] Efficient bottom-up hybrid hierarchical clustering techniques for protein sequence classification
    Vijaya, P. A.
    Murty, M. Narasimha
    Subramanian, D. K.
    PATTERN RECOGNITION, 2006, 39 (12) : 2344 - 2355
  • [47] Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering
    Yooseph, Shibu
    Li, Weizhong
    Sutton, Granger
    BMC BIOINFORMATICS, 2008, 9 (1)
  • [48] Novel Clustering Algorithm combined with DSSP post processing for protein sequence motif discovering
    Chen, Bernard
    Tai, Phang C.
    Harrison, Robert
    Pan, Yi
    2006 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, 2006, : 449 - +
  • [49] New Graph based Sequence Clustering Approach for News Article Retrieval System
    Nagalavi, Deepa
    Hanumanthappa, M.
    2017 IEEE INTERNATIONAL CONFERENCE ON POWER, CONTROL, SIGNALS AND INSTRUMENTATION ENGINEERING (ICPCSI), 2017, : 1479 - 1482
  • [50] DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
    Russoid, Elena Tea
    Barone, Federico
    Bateman, Alex
    Cozzini, Stefano
    Punta, Marco
    Laio, Alessandro
    PLOS COMPUTATIONAL BIOLOGY, 2022, 18 (10)