A document processing pipeline for annotating chemical entities in scientific documents

被引:14
|
作者
Campos, David [1 ]
Matos, Sergio [2 ]
Oliveira, Jose L. [2 ]
机构
[1] BMD Software Lda, Rua Calouste Gulbenkian 1, P-3810074 Aveiro, Portugal
[2] Univ Aveiro, DETI IEETA, P-3810193 Aveiro, Portugal
来源
关键词
DISCOVERY; DATABASE; DRUGS;
D O I
10.1186/1758-2946-7-S1-S7
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Background: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text. Results: We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task. Conclusions: We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] A document processing pipeline for annotating chemical entities in scientific documents
    David Campos
    Sérgio Matos
    José L Oliveira
    [J]. Journal of Cheminformatics, 7
  • [2] Identification of Chemical Entities in Patent Documents
    Grego, Tiago
    Pezik, Piotr
    Couto, Francisco M.
    Rebholz-Schuhmann, Dietrich
    [J]. DISTRIBUTED COMPUTING, ARTIFICIAL INTELLIGENCE, BIOINFORMATICS, SOFT COMPUTING, AND AMBIENT ASSISTED LIVING, PT II, PROCEEDINGS, 2009, 5518 : 942 - +
  • [3] Extraction and Evaluation of Knowledge Entities from Scientific Documents
    Zhang, Chengzhi
    Mayr, Philipp
    Lu, Wei
    Zhang, Yi
    [J]. JOURNAL OF DATA AND INFORMATION SCIENCE, 2021, 6 (03) : 1 - 5
  • [4] Extraction and Evaluation of Knowledge Entities from Scientific Documents
    Chengzhi Zhang
    Philipp Mayr
    Wei Lu
    Yi Zhang
    [J]. Journal of Data and Information Science, 2021, (03) : 1 - 5
  • [5] Extraction and Evaluation of Knowledge Entities from Scientific Documents
    Chengzhi Zhang
    Philipp Mayr
    Wei Lu
    Yi Zhang
    [J]. Journal of Data and Information Science., 2021, 6 (03) - 5
  • [6] AUTOMATIC SYSTEM FOR THESAURUS-AIDED ANNOTATING OF SCIENTIFIC AND TECHNICAL DOCUMENTS
    ARZIKULOV, KA
    PIOTROVSKIJ, RG
    POPESKU, AN
    KHAZHINSKAYA, MS
    [J]. NAUCHNO-TEKHNICHESKAYA INFORMATSIYA SERIYA 2-INFORMATSIONNYE PROTSESSY I SISTEMY, 1978, (12): : 12 - 20
  • [7] Extracting discourse elements and annotating scientific documents using the SciAnnotDoc model: a use case in gender documents
    de Ribaupierre, Helene
    Falquet, Gilles
    [J]. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2018, 19 (2-3) : 271 - 286
  • [8] Scientific processing pipeline for ASPIICS coronagraph
    Shestov, Sergei
    Bourgoinie, Bram
    Nicula, Bogdan
    Dolla, Laurent
    Jean, Christopje
    Verstringe, Freek
    Katsiyannis, Athanassios C.
    Inhester, Bernd
    Maia, Dalmiro
    Ribeiro, Bruno
    Zhukov, Andrei
    [J]. SPACE TELESCOPES AND INSTRUMENTATION 2020: OPTICAL, INFRARED, AND MILLIMETER WAVE, 2021, 11443
  • [9] Comparing manual and automated extraction of chemical entities from documents
    Christian Tyrchan
    Sorel Muresan
    [J]. Journal of Cheminformatics, 2 (Suppl 1)
  • [10] GarNLP: A Natural Language Processing Pipeline for Garnishment Documents
    Bordino, Ilaria
    Ferretti, Andrea
    Gullo, Francesco
    Pascolutti, Stefano
    [J]. INFORMATION SYSTEMS FRONTIERS, 2021, 23 (01) : 101 - 114