Improving protein function prediction methods with integrated literature data

被引:14
|
作者
Gabow, Aaron P. [1 ]
Leach, Sonia M. [1 ,3 ]
Baumgartner, William A. [1 ]
Hunter, Lawrence E. [1 ,2 ]
Goldberg, Debra S. [1 ,2 ]
机构
[1] Univ Colorado, Dept Pharmacol, Denver Hlth Sci Ctr, Aurora, CO 80045 USA
[2] Univ Colorado, Dept Comp Sci, Boulder, CO 80309 USA
[3] Katholieke Univ Leuven, Dept Elect Engn ESAT, Res Div SCD, B-3001 Louvain, Belgium
关键词
D O I
10.1186/1471-2105-9-198
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity. Results: We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial. Conclusion: Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] A comprehensive review and comparison of existing computational methods for protein function prediction
    Lin, Baohui
    Luo, Xiaoling
    Liu, Yumeng
    Jin, Xiaopeng
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (04)
  • [42] An expanded evaluation of protein function prediction methods shows an improvement in accuracy
    Jiang, Yuxiang
    Oron, Tal Ronnen
    Clark, Wyatt T.
    Bankapur, Asma R.
    D'Andrea, Daniel
    Lepore, Rosalba
    Funk, Christopher S.
    Kahanda, Indika
    Verspoor, Karin M.
    Ben-Hur, Asa
    Koo, Da Chen Emily
    Penfold-Brown, Duncan
    Shasha, Dennis
    Youngs, Noah
    Bonneau, Richard
    Lin, Alexandra
    Sahraeian, Sayed M. E.
    Martelli, Pier Luigi
    Profiti, Giuseppe
    Casadio, Rita
    Cao, Renzhi
    Zhong, Zhaolong
    Cheng, Jianlin
    Altenhoff, Adrian
    Skunca, Nives
    Dessimoz, Christophe
    Dogan, Tunca
    Hakala, Kai
    Kaewphan, Suwisa
    Mehryary, Farrokh
    Salakoski, Tapio
    Ginter, Filip
    Fang, Hai
    Smithers, Ben
    Oates, Matt
    Gough, Julian
    Toronen, Petri
    Koskinen, Patrik
    Holm, Liisa
    Chen, Ching-Tai
    Hsu, Wen-Lian
    Bryson, Kevin
    Cozzetto, Domenico
    Minneci, Federico
    Jones, David T.
    Chapman, Samuel
    Dukka, B. K. C.
    Khan, Ishita K.
    Kihara, Daisuke
    Ofer, Dan
    GENOME BIOLOGY, 2016, 17
  • [43] Improving the prediction of ranking data
    Marco A. Palma
    Empirical Economics, 2017, 53 : 1681 - 1710
  • [44] Approaches for Improving Literature Review Methods
    Jaffe, Rachel
    Cowell, Julia Muennich
    JOURNAL OF SCHOOL NURSING, 2014, 30 (04): : 236 - 239
  • [45] Comparing early and late data fusion methods for gene function prediction
    Re, Matteo
    Valentini, Giorgio
    NEURAL NETS WIRN09, 2009, 204 : 197 - 207
  • [46] Improving the prediction of ranking data
    Palma, Marco A.
    EMPIRICAL ECONOMICS, 2017, 53 (04) : 1681 - 1710
  • [47] Protein function prediction and annotation in an integrated environment powered by web services (AFAWE)
    Joecker, Anika
    Hoffmann, Fabian
    Groscurth, Andreas
    Schoof, Heiko
    BIOINFORMATICS, 2008, 24 (20) : 2393 - 2394
  • [48] IntFOLD: an integrated web resource for high performance protein structure and function prediction
    McGuffin, Liam J.
    Adiyaman, Recep
    Maghrabi, Ali H. A.
    Shuid, Ahmad N.
    Brackenridge, Danielle A.
    Nealon, John O.
    Philomina, Limcy S.
    NUCLEIC ACIDS RESEARCH, 2019, 47 (W1) : W408 - W413
  • [49] Parzen Windows Based Protein Function Prediction Using Protein-Protein Interaction Data
    Koura, A. M.
    Kamal, A. H.
    Abdul-Rahman, I. F.
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2010, 10 (07): : 123 - 128
  • [50] Increasing the Reliability of Protein-Protein Interactions and Protein Function Prediction Based on the Experimental Identification Methods
    Ahmed, Khaled S.
    2013 30TH NATIONAL RADIO SCIENCE CONFERENCE (NRSC2013), 2013, : 566 - 572