A large-scale assessment of sequence database search tools for homology-based protein function prediction

被引:4
|
作者
Zhang, Chengxin [1 ]
Freddolino, Lydia [1 ]
机构
[1] Univ Michigan, Dept Computat Med & Bioinformat, Dept Biol Chem, 100 Washtenaw Ave, Ann Arbor, MI 48109 USA
关键词
Gene Ontology; protein function prediction; sequence database search; BLASTp; DIAMOND; MMseqs2; ANNOTATION; GENERATION; ZEBRAFISH;
D O I
10.1093/bib/bbae349
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Sequence database searches followed by homology-based function transfer form one of the oldest and most popular approaches for predicting protein functions, such as Gene Ontology (GO) terms. These searches are also a critical component in most state-of-the-art machine learning and deep learning-based protein function predictors. Although sequence search tools are the basis of homology-based protein function prediction, previous studies have scarcely explored how to select the optimal sequence search tools and configure their parameters to achieve the best function prediction. In this paper, we evaluate the effect of using different options from among popular search tools, as well as the impacts of search parameters, on protein function prediction. When predicting GO terms on a large benchmark dataset, we found that BLASTp and MMseqs2 consistently exceed the performance of other tools, including DIAMOND-one of the most popular tools for function prediction-under default search parameters. However, with the correct parameter settings, DIAMOND can perform comparably to BLASTp and MMseqs2 in function prediction. Additionally, we developed a new scoring function to derive GO prediction from homologous hits that consistently outperform previously proposed scoring functions. These findings enable the improvement of almost all protein function prediction algorithms with a few easily implementable changes in their sequence homolog-based component. This study emphasizes the critical role of search parameter settings in homology-based function transfer and should have an important contribution to the development of future protein function prediction algorithms.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information
    Yao, Shuwei
    You, Ronghui
    Wang, Shaojun
    Xiong, Yi
    Huang, Xiaodi
    Zhu, Shanfeng
    NUCLEIC ACIDS RESEARCH, 2021, 49 (W1) : W469 - W475
  • [32] NetGO: improving large-scale protein function prediction with massive network information
    You, Ronghui
    Yao, Shuwei
    Xiong, Yi
    Huang, Xiaodi
    Sun, Fengzhu
    Mamitsuka, Hiroshi
    Zhu, Shanfeng
    NUCLEIC ACIDS RESEARCH, 2019, 47 (W1) : W379 - W387
  • [33] DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction
    You, Ronghui
    Yao, Shuwei
    Mamitsuka, Hiroshi
    Zhu, Shanfeng
    BIOINFORMATICS, 2021, 37 : I262 - I271
  • [34] Tools for Interpreting Large-scale Protein Profiling in Microbiology
    Hendrickson, E. L.
    Lamont, R. J.
    Hackett, M.
    JOURNAL OF DENTAL RESEARCH, 2008, 87 (11) : 1004 - 1015
  • [35] Large-scale model quality assessment for improving protein tertiary structure prediction
    Cao, Renzhi
    Bhattacharya, Debswapna
    Adhikari, Badri
    Li, Jilong
    Cheng, Jianlin
    BIOINFORMATICS, 2015, 31 (12) : 116 - 123
  • [36] A resource database for protein kinase substrate sequence-preference motifs based on large-scale mass spectrometry data
    Poll, Brian G.
    Leo, Kirby T.
    Deshpande, Venky
    Jayatissa, Nipun
    Pisitkun, Trairak
    Park, Euijung
    Yang, Chin-Rang
    Raghuram, Viswanathan
    Knepper, Mark A.
    CELL COMMUNICATION AND SIGNALING, 2024, 22 (01)
  • [37] A resource database for protein kinase substrate sequence-preference motifs based on large-scale mass spectrometry data
    Brian G. Poll
    Kirby T. Leo
    Venky Deshpande
    Nipun Jayatissa
    Trairak Pisitkun
    Euijung Park
    Chin-Rang Yang
    Viswanathan Raghuram
    Mark A. Knepper
    Cell Communication and Signaling, 22
  • [38] Using homology relations within a database markedly boosts protein sequence similarity search
    Tong, Jing
    Sadreyev, Ruslan I.
    Pei, Jimin
    Kinch, Lisa N.
    Grishin, Nick V.
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2015, 112 (22) : 7003 - 7008
  • [39] Recommendation Systems and Their Preference Prediction Algorithms in a Large-Scale Database
    Takimoto, Seiji
    Hirose, Hideo
    INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2009, 12 (05): : 1165 - 1182
  • [40] Large-Scale Prediction of Human Protein-Protein Interactions from Amino Acid Sequence Based on Latent Topic Features
    Pan, Xiao-Yong
    Zhang, Ya-Nan
    Shen, Hong-Bin
    JOURNAL OF PROTEOME RESEARCH, 2010, 9 (10) : 4992 - 5001