On the quality of tree-based protein classification

被引:12
|
作者
Lazareva-Ulitsky, B [1 ]
Diemer, K [1 ]
Thomas, PD [1 ]
机构
[1] Appl Biosyst Inc, Computat Biol Dept, Foster City, CA 94404 USA
关键词
D O I
10.1093/bioinformatics/bti244
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Phylogenetic analysis of protein sequences is widely used in protein function classification and delineation of subfamilies within larger families. In addition, the recent increase in the number of protein sequence entries with controlled vocabulary terms describing function (e.g. the Gene Ontology) suggests that it may be possible to overlay these terms onto phylogenetic trees to automatically locate functional divergence events in protein family evolution. Phylogenetic analysis of large datasets requires fast algorithms; and even 'fast', approximate distance matrix-based phylogenetic algorithms are slow on large datasets since they involve calculating maximum likelihood estimates of pairwise evolutionary distances. There have been many attempts to classify protein sequences on the family and subfamily level without reconstructing phylogenetic trees, but using hierarchical clustering with simpler distance measures, which also produce trees or dendrograms. How can these trees be compared in their ability to accurately classify protein sequences? Results: Given a 'reference classification' or 'group membership labels' for a set of related protein sequences as well as a tree describing their relationships (e.g. a phylogenetic tree), we propose a method for dividing the tree into monophyletic or paraphyletic groups so as to optimize the correspondence between the reference groups and the tree-derived groups. We call the achieved optimal correspondence the 'accuracy of a tree-based classification (TBC)', which measures the ability of a tree to separate proteins of similar function into monophyletic or paraphyletic groups. We apply this measure to compare classical NJ and UPGMA phylogenetic trees with the trees obtained from hierarchical clustering using different protein similarity measures. Our preliminary analysis on a set of expert-curated protein families and alignments suggests that there is no uniformly superior algorithm, and that simple protein similarity measures combined with hierarchical clustering produce trees with reasonable and often the most accurate TBC. We used our measure to help us to design TIPS, a tree-building algorithm, based on agglomerative clustering with a similarity measure derived from profile scoring. TIPS is comparable with phylogenetic algorithms in terms of classification accuracy and is much faster on large protein families. Due to its time scalability and acceptable accuracy, TIPS is being used in the large-scale PANTHER protein classification project. The trees produced by different algorithms for different protein families can be viewed at http://panther.appliedbiosystems.com/pub/tree_quality/trees.jsp. For every tree and every level of classification granularity we provide the optimal TBC along with the reference classification. \
引用
收藏
页码:1876 / 1890
页数:15
相关论文
共 50 条
  • [1] Tree-based disease classification using protein data
    Zhu, HT
    Yu, CY
    Zhang, HP
    [J]. PROTEOMICS, 2003, 3 (09) : 1673 - 1677
  • [2] Tree-based classification and regression Part 3: Tree-based procedures
    Gunter, B
    [J]. QUALITY PROGRESS, 1998, 31 (02) : 121 - 123
  • [3] Tree-based software quality classification using genetic programming
    Liu, Y
    Khoshgoftaar, T
    [J]. NINTH ISSAT INTERNATIONAL CONFERENCE ON RELIABILITY AND QUALITY IN DESIGN, 2003 PROCEEDINGS, 2003, : 183 - 188
  • [4] Tree-based classification of tabla strokes
    Deolekar, Subodh
    Abraham, Siby
    [J]. CURRENT SCIENCE, 2018, 115 (09): : 1724 - 1731
  • [5] Tree-based signatures for shape classification
    Bauckhage, Christian
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP 2006, PROCEEDINGS, 2006, : 2105 - 2108
  • [6] Tree-Based Vehicle Classification System
    Saripan, Kiatkachorn
    Nuthong, Chaiwat
    [J]. 2017 14TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY (ECTI-CON), 2017, : 439 - 442
  • [7] Classification Tree-Based Wheel Unbalance Detection
    Todeschini, Riccardo
    Pozzato, Gabriele
    Strada, Silvia C.
    Savaresi, Sergio M.
    Dambach, Gerhard
    [J]. 5TH IEEE CONFERENCE ON CONTROL TECHNOLOGY AND APPLICATIONS (IEEE CCTA 2021), 2021, : 1103 - 1108
  • [8] Tree-based Classification to Users' Trustworthiness in OSNs
    Nabipourshiri, Rouzbeh
    Abu-Salih, Bilal
    Wongthongtham, Pornpit
    [J]. PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING (ICCAE 2018), 2018, : 190 - 194
  • [9] Feature-Selected Tree-Based Classification
    Freeman, Cecille
    Kulic, Dana
    Basir, Otman
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2013, 43 (06) : 1990 - 2004
  • [10] Tree-Based Ensemble Models and Algorithms for Classification
    Tsiligaridis, J.
    [J]. 2023 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION, ICAIIC, 2023, : 103 - 106