A Comprehensive Benchmark of Kernel Methods to Extract Protein-Protein Interactions from Literature

被引:98
|
作者
Tikk, Domonkos [1 ,2 ]
Thomas, Philippe [1 ]
Palaga, Peter [1 ]
Hakenberg, Joerg [3 ]
Leser, Ulf [1 ]
机构
[1] Humboldt Univ, Dept Comp Sci, Berlin, Germany
[2] Budapest Univ Technol & Econ, Dept Telecommun & Media Informat, H-1117 Budapest, Hungary
[3] Arizona State Univ, Dept Comp Sci & Engn, Tempe, AZ 85287 USA
关键词
NATURAL-LANGUAGE PARSERS; INFORMATION EXTRACTION; COMPLEXES; NETWORK; CORPUS; TEXT;
D O I
10.1371/journal.pcbi.1000837
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein-protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Kernel methods for predicting protein-protein interactions
    Ben-Hur, A
    Noble, WS
    [J]. BIOINFORMATICS, 2005, 21 : I38 - I46
  • [2] Discovering patterns to extract protein-protein interactions from the literature: Part II
    Hao, Y
    Zhu, XY
    Huang, ML
    Li, M
    [J]. BIOINFORMATICS, 2005, 21 (15) : 3294 - 3300
  • [3] Mining physical protein-protein interactions from the literature
    Huang, Minlie
    Ding, Shilin
    Wang, Hongning
    Zhu, Xiaoyan
    [J]. GENOME BIOLOGY, 2008, 9
  • [4] Mining physical protein-protein interactions from the literature
    Huang M.
    Ding S.
    Wang H.
    Zhu X.
    [J]. Genome Biology, 9 (Suppl 2):
  • [5] On network-based kernel methods for protein-protein interactions with applications in protein functions prediction
    Limin Li
    Waiki Ching
    Yatming Chan
    Hiroshi Mamitsuka
    [J]. Journal of Systems Science and Complexity, 2010, 23 : 917 - 930
  • [6] On network-based kernel methods for protein-protein interactions with applications in protein functions prediction
    Li, Limin
    Ching, Waiki
    Chan, Yatming
    Mamitsuka, Hiroshi
    [J]. JOURNAL OF SYSTEMS SCIENCE & COMPLEXITY, 2010, 23 (05) : 917 - 930
  • [7] Mining literature for protein-protein interactions
    Marcotte, EM
    Xenarios, I
    Eisenberg, D
    [J]. BIOINFORMATICS, 2001, 17 (04) : 359 - 363
  • [8] A hybrid approach to extract protein-protein interactions
    Bui, Quoc-Chinh
    Katrenko, Sophia
    Sloot, Peter M. A.
    [J]. BIOINFORMATICS, 2011, 27 (02) : 259 - 265
  • [9] New benchmark metrics for protein-protein docking methods
    Gao, Mu
    Skolnick, Jeffrey
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2011, 79 (05) : 1623 - 1634
  • [10] Mining Impact of Protein Modifications on Protein-Protein Interactions from Literature
    Siu, Amy
    Arighi, Cecilia
    Nchoutmboube, Jules
    Tudor, Catalina O.
    Vijay-Shanker, K.
    Wu, Cathy H.
    [J]. BIBMW: 2009 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOP, 2009, : 343 - 343