Efficient Extraction of Protein-Protein Interactions from Full-Text Articles

被引:22
|
作者
Hakenberg, Joerg [1 ]
Leaman, Robert [2 ]
Vo, Nguyen Ha [1 ]
Jonnalagadda, Siddhartha [2 ]
Sullivan, Ryan [2 ]
Miller, Christopher [2 ]
Tari, Luis [3 ]
Baral, Chitta [1 ]
Gonzalez, Graciela [2 ]
机构
[1] Arizona State Univ, Dept Comp Sci, Tempe, AZ 85281 USA
[2] Arizona State Univ, Dept Biomed Informat, Phoenix, AZ 85004 USA
[3] Hoffmann La Roche Inc, Nutley, NJ 07110 USA
基金
美国国家科学基金会;
关键词
Biology and genetics; text analysis; bioinformatics (genome or protein) databases; NORMALIZATION;
D O I
10.1109/TCBB.2010.51
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information ( see Appendix).
引用
收藏
页码:481 / 494
页数:14
相关论文
共 50 条
  • [1] Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features
    Kolchinsky, Artemy
    Abi-Haidar, Alaa
    Kaur, Jasleen
    Hamed, Ahmed Abdeen
    Rocha, Luis M.
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2010, 7 (03) : 400 - 411
  • [2] BioC-compatible full-text passage detection for protein-protein interactions using extended dependency graph
    Peng, Yifan
    Arighi, Cecilia
    Wu, Cathy H.
    Vijay-Shanker, K.
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2016,
  • [3] The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions
    Dogan, Rezarta Islamaj
    Kim, Sun
    Chatr-aryamontri, Andrew
    Chang, Christie S.
    Oughtred, Rose
    Rust, Jennifer
    Wilbur, W. John
    Comeau, Donald C.
    Dolinski, Kara
    Tyers, Mike
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2017,
  • [4] Challenges for automatically extracting molecular interactions from full-text articles
    Tara McIntosh
    James R Curran
    [J]. BMC Bioinformatics, 10
  • [5] Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles
    Dai, Hong-Jie
    Lai, Po-Ting
    Tsai, Richard Tzong-Han
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2010, 7 (03) : 412 - 420
  • [6] Challenges for automatically extracting molecular interactions from full-text articles
    McIntosh, Tara
    Curran, James R.
    [J]. BMC BIOINFORMATICS, 2009, 10 : 311
  • [7] Layout-aware text extraction from full-text PDF of scientific articles
    Ramakrishnan, Cartic
    Patnia, Abhishek
    Hovy, Eduard
    Burns, Gully A. P. C.
    [J]. SOURCE CODE FOR BIOLOGY AND MEDICINE, 2012, 7 (01):
  • [8] Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles
    Zheng, Wu
    Blake, Catherine
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 57 : 134 - 144
  • [9] Full-text journal articles on the Internet
    Prakash, CS
    [J]. AUSTRALASIAN BIOTECHNOLOGY, 1998, 8 (05) : 308 - 309
  • [10] A hybrid method for extraction of protein-protein interactions from literature
    Qian, Weizhong
    Lyle, Ungar
    Qin, Zhiguang
    Fu, Chong
    [J]. High Technology Letters, 2011, 17 (01) : 32 - 38