ChemicalTagger: A tool for semantic text-mining in chemistry

被引:113
|
作者
Hawizy, Lezan [1 ]
Jessop, David M. [1 ]
Adams, Nico [2 ]
Murray-Rust, Peter [1 ]
机构
[1] Univ Cambridge, Dept Chem, Unilever Ctr Mol Sci Informat, Cambridge CB2 1EW, England
[2] European Bioinformat Inst, Cambridge CB10 1SD, England
来源
关键词
WEB; XML;
D O I
10.1186/1758-2946-3-17
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Background: The primary method for scientific communication is in the form of published scientific articles and theses which use natural language combined with domain-specific terminology. As such, they contain free owing unstructured text. Given the usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP) approaches. Results: We have developed the ChemicalTagger parser as a medium-depth, phrase-based semantic NLP tool for the language of chemical experiments. Tagging is based on a modular architecture and uses a combination of OSCAR, domain-specific regex and English taggers to identify parts-of-speech. The ANTLR grammar is used to structure this into tree-based phrases. Using a metric that allows for overlapping annotations, we achieved machine-annotator agreements of 88.9% for phrase recognition and 91.9% for phrase-type identification (Action names). Conclusions: It is possible parse to chemical experimental text using rule-based techniques in conjunction with a formal grammar parser. ChemicalTagger has been deployed for over 10,000 patents and has identified solvents from their linguistic context with >99.5% precision.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] ChemicalTagger: A tool for semantic text-mining in chemistry
    Hawizy, Lezan
    Jessop, Dave M.
    Murray-Rust, Peter
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2010, 240
  • [2] ChemicalTagger: A tool for semantic text-mining in chemistry
    Lezan Hawizy
    David M Jessop
    Nico Adams
    Peter Murray-Rust
    [J]. Journal of Cheminformatics, 3
  • [3] Text-Mining and Neuroscience
    Ambert, Kyle H.
    Cohen, Aaron M.
    [J]. BIOINFORMATICS OF BEHAVIOR: PART 1, 2012, 103 : 109 - 132
  • [4] Anni 2.0: a multipurpose text-mining tool for the life sciences
    Rob Jelier
    Martijn J Schuemie
    Antoine Veldhoven
    Lambert CJ Dorssers
    Guido Jenster
    Jan A Kors
    [J]. Genome Biology, 9
  • [5] Text-mining to produce large chemistry datasets for community access
    Williams, Antony
    Lowe, Daniel
    Tetko, Igor
    Coba, Carlos
    Tkachenko, Valery
    Pshenichnov, Alexey
    Karapetyan, Ken
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2015, 250
  • [6] Anni 2.0: a multipurpose text-mining tool for the life sciences
    Jelier, Rob
    Schuemie, Martijn J.
    Veldhoven, Antoine
    Dorssers, Lambert C. J.
    Jenster, Guido
    Kors, Jan A.
    [J]. GENOME BIOLOGY, 2008, 9 (06)
  • [7] OntoMate: a text-mining tool aiding curation at the Rat Genome Database
    Liu, Weisong
    Laulederkind, Stanley J. F.
    Hayman, G. Thomas
    Wang, Shur-Jen
    Nigam, Rajni
    Smith, Jennifer R.
    De Pons, Jeff
    Dwinell, Melinda R.
    Shimoyama, Mary
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2015,
  • [8] Text-Mining the Voice of the People
    Evangelopoulos, Nicholas
    Visinescu, Lucian
    [J]. COMMUNICATIONS OF THE ACM, 2012, 55 (02) : 55 - 62
  • [9] Maximizing text-mining performance
    Weiss, SM
    Apte, C
    Damerau, FJ
    Johnson, DE
    Oles, FJ
    Goetz, T
    Hampp, T
    [J]. IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS, 1999, 14 (04): : 63 - 69
  • [10] Text-mining assisted regulatory annotation
    Aerts, Stein
    Haeussler, Maximilian
    van Vooren, Steven
    Griffith, Obi L.
    Hulpiau, Paco
    Jones, Steven J. M.
    Montgomery, Stephen B.
    Bergman, Casey M.
    [J]. GENOME BIOLOGY, 2008, 9 (02)