A document processing pipeline for annotating chemical entities in scientific documents

被引:14
|
作者
Campos, David [1 ]
Matos, Sergio [2 ]
Oliveira, Jose L. [2 ]
机构
[1] BMD Software Lda, Rua Calouste Gulbenkian 1, P-3810074 Aveiro, Portugal
[2] Univ Aveiro, DETI IEETA, P-3810193 Aveiro, Portugal
来源
关键词
DISCOVERY; DATABASE; DRUGS;
D O I
10.1186/1758-2946-7-S1-S7
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Background: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text. Results: We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task. Conclusions: We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Rule-Based Natural Language Processing Pipeline to Detect Medication-Related Named Entities: Insights for Transfer Learning
    Wong, Zoie S. Y.
    Waters, Neil
    Kuo, Nicholas I-Hsien
    Liu, Jiaxing
    [J]. MEDINFO 2023 - THE FUTURE IS ACCESSIBLE, 2024, 310 : 584 - 588
  • [32] Intelligent processing of incoming paper-based business documents by means of knowledge-based document analysis
    Bleisinger, R
    Müller, M
    Hartmann, P
    Dörstling, T
    [J]. WIRTSCHAFTSINFORMATIK, 1999, 41 (04): : 371 - +
  • [33] Leveraging Natural Language Processing to Analyze Scientific Content: Proposal of an NLP Pipeline for the Field of Computer Vision
    Kortum, Henrik
    Leimkuehler, Max
    Thomas, Oliver
    [J]. INNOVATION THROUGH INFORMATION SYSTEMS, VOL II: A COLLECTION OF LATEST RESEARCH ON TECHNOLOGY ISSUES, 2021, 47 : 40 - 55
  • [34] Natural language processing methods for knowledge management-Applying document clustering for fast search and grouping of engineering documents
    Arnarsson, Ivar Orn
    Frost, Otto
    Gustavsson, Emil
    Jirstrand, Mats
    Malmqvist, Johan
    [J]. CONCURRENT ENGINEERING-RESEARCH AND APPLICATIONS, 2021, 29 (02): : 142 - 152
  • [35] TRENDS OF SCIENTIFIC-TECHNICAL PROGRESS IN MACHINE - BUILDING FOR CHEMICAL AND OIL PROCESSING INDUSTRIES
    MUNTJAN, J
    [J]. CHEMISCHE TECHNIK, 1974, 26 (01): : 2 - 3
  • [36] CHEMSCANNER: extraction and re-use(ability) of chemical information from common scientific documents containing ChemDraw files
    An Nguyen
    Huang, Yu-Chieh
    Tremouilhac, Pierre
    Jung, Nicole
    Braese, Stefan
    [J]. JOURNAL OF CHEMINFORMATICS, 2019, 11 (01)
  • [37] ChemScanner: extraction and re-use(ability) of chemical information from common scientific documents containing ChemDraw files
    An Nguyen
    Yu-Chieh Huang
    Pierre Tremouilhac
    Nicole Jung
    Stefan Bräse
    [J]. Journal of Cheminformatics, 11
  • [38] Processing the Document Flow at Branch Departments of Scientific Information at VINITI RAS and an Interpretation of this Work within the Framework of Physical Ideas
    Shamaev, V. G.
    Shamaev, N. V.
    [J]. AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS, 2010, 44 (04) : 224 - 234
  • [40] DOCUMENT PROCESSING TE.HNOLOGY IN INFORMATION SYSTEM OF PLANNING OF SCIENTIFIC RESEARCHES OF THE NATIONAL ACADEMY OF PEDAGOGICAL SCIENCES OF UKRAINE
    Kuznetsova, Tetyana V.
    [J]. INFORMATION TECHNOLOGIES AND LEARNING TOOLS, 2012, 30 (04)