Efficient Extraction of Protein-Protein Interactions from Full-Text Articles

被引:22
|
作者
Hakenberg, Joerg [1 ]
Leaman, Robert [2 ]
Vo, Nguyen Ha [1 ]
Jonnalagadda, Siddhartha [2 ]
Sullivan, Ryan [2 ]
Miller, Christopher [2 ]
Tari, Luis [3 ]
Baral, Chitta [1 ]
Gonzalez, Graciela [2 ]
机构
[1] Arizona State Univ, Dept Comp Sci, Tempe, AZ 85281 USA
[2] Arizona State Univ, Dept Biomed Informat, Phoenix, AZ 85004 USA
[3] Hoffmann La Roche Inc, Nutley, NJ 07110 USA
基金
美国国家科学基金会;
关键词
Biology and genetics; text analysis; bioinformatics (genome or protein) databases; NORMALIZATION;
D O I
10.1109/TCBB.2010.51
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information ( see Appendix).
引用
收藏
页码:481 / 494
页数:14
相关论文
共 50 条
  • [31] Tagging Gene and Protein Names in Full Text Articles
    National Center for Biotechnology Information, NLM, NIH, Bethesda
    MD
    20894, United States
    [J]. Proc. Annu. Meet. Assoc. Comput Linguist., (9-13):
  • [32] Predicting Protein-Protein Interactions Using Full Bayesian Network
    Li, Hui
    Liu, Chunmei
    Burge, Legand
    Ko, Kyung Dae
    Southerland, William
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOPS (BIBMW), 2012,
  • [33] Dynamic programming re-ranking for PPI interactor and pair extraction in full-text articles
    Tsai, Richard Tzong-Han
    Lai, Po-Ting
    [J]. BMC BIOINFORMATICS, 2011, 12
  • [34] Protein-Protein Interactions Classification from Text via Local Learning with Class Priors
    He, Yulan
    Lin, Chenghua
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2010, 5723 : 182 - 191
  • [35] Dynamic programming re-ranking for PPI interactor and pair extraction in full-text articles
    Richard Tzong-Han Tsai
    Po-Ting Lai
    [J]. BMC Bioinformatics, 12
  • [36] The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text
    Krallinger, Martin
    Vazquez, Miguel
    Leitner, Florian
    Salgado, David
    Chatr-aryamontri, Andrew
    Winter, Andrew
    Perfetto, Livia
    Briganti, Leonardo
    Licata, Luana
    Iannuccelli, Marta
    Castagnoli, Luisa
    Cesareni, Gianni
    Tyers, Mike
    Schneider, Gerold
    Rinaldi, Fabio
    Leaman, Robert
    Gonzalez, Graciela
    Matos, Sergio
    Kim, Sun
    Wilbur, W. John
    Rocha, Luis
    Shatkay, Hagit
    Tendulkar, Ashish V.
    Agarwal, Shashank
    Liu, Feifan
    Wang, Xinglong
    Rak, Rafal
    Noto, Keith
    Elkan, Charles
    Lu, Zhiyong
    Dogan, Rezarta Islamaj
    Fontaine, Jean-Fred
    Andrade-Navarro, Miguel A.
    Valencia, Alfonso
    [J]. BMC BIOINFORMATICS, 2011, 12
  • [37] The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text
    Martin Krallinger
    Miguel Vazquez
    Florian Leitner
    David Salgado
    Andrew Chatr-aryamontri
    Andrew Winter
    Livia Perfetto
    Leonardo Briganti
    Luana Licata
    Marta Iannuccelli
    Luisa Castagnoli
    Gianni Cesareni
    Mike Tyers
    Gerold Schneider
    Fabio Rinaldi
    Robert Leaman
    Graciela Gonzalez
    Sergio Matos
    Sun Kim
    W John Wilbur
    Luis Rocha
    Hagit Shatkay
    Ashish V Tendulkar
    Shashank Agarwal
    Feifan Liu
    Xinglong Wang
    Rafal Rak
    Keith Noto
    Charles Elkan
    Zhiyong Lu
    Rezarta Islamaj Dogan
    Jean-Fred Fontaine
    Miguel A Andrade-Navarro
    Alfonso Valencia
    [J]. BMC Bioinformatics, 12
  • [38] Efficient mining from heterogeneous data sets for predicting protein-protein interactions
    Mamitsuka, H
    [J]. 14TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2003, : 32 - 36
  • [39] Copyright compliance and infringement in ResearchGate full-text journal articles
    Jamali, Hamid R.
    [J]. SCIENTOMETRICS, 2017, 112 (01) : 241 - 254
  • [40] Using R to develop a corpus of full-text journal articles
    Anderson, Billie
    Bani-Yaghoub, Majid
    Kantheti, Vagmi
    Curtis, Scott
    [J]. JOURNAL OF INFORMATION SCIENCE, 2023,