Mining protein phosphorylation information from biomedical literature using NLP parsing and Support Vector Machines

被引:1
|
作者
Raja, Kalpana [1 ,2 ]
Natarajan, Jeyakumar [1 ]
机构
[1] Bharathiar Univ, Sch Life Sci, Dept Bioinformat, Data Min & Text Min Lab, Coimbatore 641046, Tamil Nadu, India
[2] Univ Michigan, Sch Med, Dept Dermatol, Ann Arbor, MI USA
关键词
Human protein phosphorylation; hPP corpus; Support Vector Machines; Natural language processing; Information extraction; Post transcriptional modification; EXTRACTION; DATABASE; SYSTEM;
D O I
10.1016/j.cmpb.2018.03.022
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background: Extraction of protein phosphorylation information from biomedical literature has gained much attention because of the importance in numerous biological processes. Objective: In this study, we propose a text mining methodology which consists of two phases, NLP parsing and SVM classification to extract phosphorylation information from literature. Methods: First, using NLP parsing we divide the data into three base-forms depending on the biomedical entities related to phosphorylation and further classify into ten sub-forms based on their distribution with phosphorylation keyword. Next, we extract the phosphorylation entity singles/pairs/triplets and apply SVM to classify the extracted singles/pairs/triplets using a set of features applicable to each sub-form. Results: The performance of our methodology was evaluated on three corpora namely PLC, iProLink and hPP corpus. We obtained promising results of >85% F-score on ten sub-forms of training datasets on cross validation test. Our system achieved overall F-score of 93.0% on iProLink and 96.3% on hPP corpus test datasets. Furthermore, our proposed system achieved best performance on cross corpus evaluation and outperformed the existing system with recall of 90.1%. Conclusions: The performance analysis of our unique system on three corpora reveals that it extracts protein phosphorylation information efficiently in both non-organism specific general datasets such as PLC and iProLink, and human specific dataset such as hPP corpus. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:57 / 64
页数:8
相关论文
共 50 条
  • [41] Identification of "comment-on sentences" in online biomedical documents using support vector machines
    Kim, In Cheol
    Le, Daniel X.
    Thoma, George R.
    DOCUMENT RECOGNITION AND RETRIEVAL XIV, 2007, 6500
  • [42] Extracting and mining protein-protein interaction network from biomedical literature
    Hu, XH
    Yoo, IH
    Song, IY
    Song, M
    Han, JC
    Lechner, M
    PROCEEDINGS OF THE 2004 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2004, : 244 - 251
  • [43] A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
    Buzhou Tang
    Yudong Feng
    Xiaolong Wang
    Yonghui Wu
    Yaoyun Zhang
    Min Jiang
    Jingqi Wang
    Hua Xu
    Journal of Cheminformatics, 7
  • [44] A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
    Tang, Buzhou
    Feng, Yudong
    Wang, Xiaolong
    Wu, Yonghui
    Zhang, Yaoyun
    Jiang, Min
    Wang, Jingqi
    Xu, Hua
    JOURNAL OF CHEMINFORMATICS, 2015, 7
  • [45] Protein Secondary Structure Prediction Using Support Vector Machines (SVMs)
    Patel, Mayuri
    Shah, Hitesh
    2013 INTERNATIONAL CONFERENCE ON MACHINE INTELLIGENCE AND RESEARCH ADVANCEMENT (ICMIRA 2013), 2013, : 594 - 598
  • [46] Predicting Protein Subcellular Localization using PsePSSM and Support Vector Machines
    Juan, Eric Y. T.
    Jhang, J. H.
    Li, W. J.
    PROCEEDINGS OF THE 11TH JOINT CONFERENCE ON INFORMATION SCIENCES, 2008,
  • [47] Protein fold recognition using neural networks and support vector machines
    Jiang, N
    Wu, WXY
    Mitchell, I
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING IDEAL 2005, PROCEEDINGS, 2005, 3578 : 462 - 469
  • [48] Human papillomavirus risk type classification from protein sequences using support vector machines
    Kim, Sun
    Zhang, Young-Tak
    APPLICATIONS OF EVOLUTIONARY COMPUTING, PROCEEDINGS, 2006, 3907 : 57 - 66
  • [49] Mining protein interaction from biomedical literature with relation kernel method
    Eom, Jae-Hong
    Zhang, Byoung Tak
    ADVANCES IN NEURAL NETWORKS - ISNN 2006, PT 3, PROCEEDINGS, 2006, 3973 : 642 - 647
  • [50] Mining Faces from Biomedical Literature using Deep Learning
    Dawson, Mitchell
    Zisserman, Andrew
    Nellaker, Christoffer
    ACM-BCB' 2017: PROCEEDINGS OF THE 8TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY,AND HEALTH INFORMATICS, 2017, : 562 - 567