Improving the performance of dictionary-based approaches in protein name recognition

被引:60
|
作者
Tsuruoka, Y
Tsujii, J
机构
[1] JST Agcy, CREST, Kawaguchi, Saitama 3320012, Japan
[2] Univ Tokyo, Dept Comp Sci, Bunkyo Ku, Tokyo 1130033, Japan
关键词
protein name recognition; naive Bayes classifier; approximate string search; spelling variant generator;
D O I
10.1016/j.jbi.2004.08.003
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Dictionary-based protein name recognition is often a first step in extracting information from biomedical documents because it can provide ID information on recognized terms. However, dictionary-based approaches present two fundamental difficulties: (1) false recognition mainly caused by short names; (2) low recall due to spelling variations. In this paper, we tackle the former problem using machine learning to filter out false positives and present two alternative methods for alleviating the latter problem of spelling variations. The first is achieved by using approximate string searching, and the second by expanding the dictionary with a probabilistic variant generator, which we propose in this paper. Experimental results using the GENIA corpus revealed that filtering using a naive Bayes classifier greatly improved precision with only a slight loss of recall, resulting in 10.8% improvement in F-measure, and dictionary expansion with the variant generator gave further 1.6% improvement and achieved an F-measure of 66.6%. (C) 2004 Elsevier Inc. All rights reserved.
引用
收藏
页码:461 / 470
页数:10
相关论文
共 50 条
  • [1] Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature
    Yang, Zhihao
    Lin, Hongfei
    Li, Yanpeng
    [J]. COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2008, 32 (04) : 287 - 291
  • [2] Improving dictionary-based named entity recognition with deep learning
    Nastou, Katerina
    Koutrouli, Mikaela
    Pyysalo, Sampo
    Jensen, Lars Juhl
    [J]. BIOINFORMATICS, 2024, 40 : ii45 - ii52
  • [3] Recognition of chemical entities: combining dictionary-based and grammar-based approaches
    Saber A Akhondi
    Kristina M Hettne
    Eelke van der Horst
    Erik M van Mulligen
    Jan A Kors
    [J]. Journal of Cheminformatics, 7
  • [4] Chemical entity recognition in patents by combining dictionary-based and statistical approaches
    Akhondi, Saber A.
    Pons, Ewoud
    Afzal, Zubair
    van Haagen, Herman
    Becker, Benedikt F. H.
    Hettne, Kristina M.
    van Mulligen, Erik M.
    Kors, Jan A.
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2016,
  • [5] Recognition of chemical entities: combining dictionary-based and grammar-based approaches
    Akhondi, Saber A.
    Hettne, Kristina M.
    van der Horst, Eelke
    van Mulligen, Erik M.
    Kors, Jan A.
    [J]. JOURNAL OF CHEMINFORMATICS, 2015, 7
  • [6] Protein Name Recognition Based on Dictionary Mining and Heuristics
    Lin, Shian-Hua
    Ding, Shao-Hong
    Zeng, Wei-Sheng
    [J]. ALGORITHMIC ASPECTS IN INFORMATION AND MANAGEMENT, AAIM 2014, 2014, 8546 : 75 - 87
  • [7] Dictionary-Based Face Recognition from Video
    Chen, Yi-Chen
    Patel, Vishal M.
    Phillips, P. Jonathon
    Chellappa, Rama
    [J]. COMPUTER VISION - ECCV 2012, PT VI, 2012, 7577 : 766 - 779
  • [8] ILLUMINATION ROBUST DICTIONARY-BASED FACE RECOGNITION
    Patel, Vishal M.
    Wu, Tao
    Biswas, Soma
    Phillips, P. Jonathon
    Chellappa, Rama
    [J]. 2011 18TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2011, : 777 - 780
  • [9] Improving dictionary-based code compression in VLIW architectures
    Nam, SJ
    Park, IC
    Kyung, CM
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 1999, E82A (11): : 2318 - 2324
  • [10] Dictionary-based syntactic pattern recognition using tries
    Oommen, BJ
    Badr, G
    [J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, PROCEEDINGS, 2004, 3138 : 251 - 259