Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics

被引:5
|
作者
Almeida, Tiago [1 ]
Antunes, Rui [1 ]
Silva, Joao F. [1 ]
Almeida, Joao R. [1 ,2 ]
Matos, Sergio [1 ]
机构
[1] Univ Aveiro, Inst Elect & Informat Engn Aveiro IEETA, Dept Elect Telecommun & Informat DETI, Aveiro, Portugal
[2] Univ A Coruna, Dept Informat & Commun Technol, La Coruna, Spain
关键词
NAMED ENTITY RECOGNITION; BIOMEDICAL TEXT; NORMALIZATION; INFORMATION; EXTRACTION; CHALLENGES;
D O I
10.1093/database/baac047
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The identification of chemicals in articles has attracted a large interest in the biomedical scientific community, given its importance in drug development research. Most of previous research have focused on PubMed abstracts, and further investigation using full-text documents is required because these contain additional valuable information that must be explored. The manual expert task of indexing Medical Subject Headings (MeSH) terms to these articles later helps researchers find the most relevant publications for their ongoing work.The BioCreative VII NLM-Chem track fostered the development of systems for chemical identification and indexing in PubMed full-text articles. Chemical identification consisted in identifying the chemical mentions and linking these to unique MeSH identifiers. This manuscript describes our participation system and the post-challenge improvements we made. We propose a three-stage pipeline that individually performs chemical mention detection, entity normalization and indexing. Regarding chemical identification, we adopted a deep-learning solution that utilizes the PubMedBERT contextualized embeddings followed by a multilayer perceptron and a conditional random field tagging layer. For the normalization approach, we use a sievebased dictionary filtering followed by a deep-learning similarity search strategy. Finally, for the indexing we developed rules for identifying the more relevant MeSH codes for each article. During the challenge, our system obtained the best official results in the normalization and indexing tasks despite the lower performance in the chemical mention recognition task. In a post-contest phase we boosted our results by improving our named entity recognition model with additional techniques.The final system achieved 0.8731, 0.8275 and 0.4849 in the chemical identification, normalization and indexing tasks, respectively.The code to reproduce our experiments and run the pipeline is publicly available.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII
    Leaman, Robert
    Islamaj, Rezarta
    Adams, Virginia
    Alliheedi, Mohammed A.
    Almeida, Joao Rafael
    Antunes, Rui
    Bevan, Robert
    Chang, Yung-Chun
    Erdengasileng, Arslan
    Hodgskiss, Matthew
    Ida, Ryuki
    Kim, Hyunjae
    Li, Keqiao
    Mercer, Robert E.
    Mertova, Lukrecia
    Mobasher, Ghadeer
    Shin, Hoo-Chang
    Sung, Mujeen
    Tsujimura, Tomoki
    Yeh, Wen-Chao
    Lu, Zhiyong
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2023, 2023
  • [2] A BERT-based ensemble learning approach for the BioCreative VII challenges: full-text chemical identification and multi-label classification in PubMed articles
    Lin, Sheng-Jie
    Yeh, Wen-Chao
    Chiu, Yu-Wen
    Chang, Yung-Chun
    Hsu, Min-Huei
    Chen, Yi-Shin
    Hsu, Wen-Lian
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2022, 2022
  • [3] Deep context of citations using machine-learning models in scholarly full-text articles
    Hassan, Saeed-Ul
    Imran, Mubashir
    Iqbal, Sehrish
    Aljohani, Naif Radi
    Nawaz, Raheel
    [J]. SCIENTOMETRICS, 2018, 117 (03) : 1645 - 1662
  • [4] Deep context of citations using machine-learning models in scholarly full-text articles
    Saeed-Ul Hassan
    Mubashir Imran
    Sehrish Iqbal
    Naif Radi Aljohani
    Raheel Nawaz
    [J]. Scientometrics, 2018, 117 : 1645 - 1662
  • [5] Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text
    Mai, Florian
    Galke, Lukas
    Scherp, Ansgar
    [J]. JCDL'18: PROCEEDINGS OF THE 18TH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, 2018, : 169 - 178
  • [6] WEIGHTED AUTOMATA FOR FULL-TEXT INDEXING
    Zhang, Meng
    Hu, Liang
    Zhang, Yi
    [J]. INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2011, 22 (04) : 921 - 943
  • [7] Full-text journal articles on the Internet
    Prakash, CS
    [J]. AUSTRALASIAN BIOTECHNOLOGY, 1998, 8 (05) : 308 - 309
  • [8] Indexing and Full-Text Coverage of Law Review Articles in Nonlegal Databases: An Initial Study
    Koulikov, Mikhail
    [J]. LAW LIBRARY JOURNAL, 2010, 102 (01): : 39 - 57
  • [9] Using Syllables As Indexing Terms in Full-Text Information Retrieval
    Kettunen, Kimmo
    Mcnamee, Paul
    Baskaya, Feza
    [J]. HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, 2010, 219 : 225 - 232
  • [10] Using R to develop a corpus of full-text journal articles
    Anderson, Billie
    Bani-Yaghoub, Majid
    Kantheti, Vagmi
    Curtis, Scott
    [J]. JOURNAL OF INFORMATION SCIENCE, 2023,