Comparing general and specialized word embeddings for biomedical named entity recognition

被引：3

作者：

Ramos-Vargas, Rigo E. ^{[1
]}

Roman-Godinez, Israel ^{[1
]}

Torres-Ramos, Sulema ^{[1
]}

机构：

[1] Univ Guadalajara, Dept Ciencias Computac, Guadalajara, Jalisco, Mexico

来源：

PEERJ COMPUTER SCIENCE | 2021年 / 7卷

关键词：

Word embeddings; BioNER; BiLSTM-CRF; DrugBank; MedLine; Pyysalo PM + PMC; Glove common crawl; ELMo embeddings; Pooled flair embeddings; Transformer embeddings; RELATEDNESS;

D O I：

10.7717/peerj-cs.384

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.

引用

页码：1 / 22

页数：22

共 50 条

[31] Biomedical Named Entity Recognition Based on MCBERT
Wang, Sai
Yilahun, Hankiz
Hamdulla, Askar
[J]. 2022 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2022), 2022, : 247 - 252
[32] A Genetic Approach for Biomedical Named Entity Recognition
Ekbal, Asif
Saha, Sriparna
Sikdar, Utpal Kumar
Hasanuzzaman, Md
[J]. 22ND INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2010), PROCEEDINGS, VOL 2, 2010, : 354 - +
[33] Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
Consoli, Bernardo
Santos, Joaquim
Gomes, Diogo
Cordeiro, Fabio
Vieira, Renata
Moreira, Viviane
[J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4625 - 4630
[34] Named Entity Recognition From Biomedical Data
Refaat, Maged
Rafea, Ahmed
Gaballah, Nada
[J]. 2023 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE, CSCI 2023, 2023, : 838 - 844
[35] A comparative study for biomedical named entity recognition
Xu Wang
Chen Yang
Renchu Guan
[J]. International Journal of Machine Learning and Cybernetics, 2018, 9 : 373 - 382
[36] A Systematic Review on Biomedical Named Entity Recognition
Kanimozhi, U.
Manjula, D.
[J]. DATA SCIENCE ANALYTICS AND APPLICATIONS, DASAA 2017, 2018, 804 : 19 - 37
[37] A comparative study for biomedical named entity recognition
Wang, Xu
Yang, Chen
Guan, Renchu
[J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2018, 9 (03) : 373 - 382
[38] Feature Importance for Biomedical Named Entity Recognition
Huggard, Hamish
Zhang, Aaron
Zhang, Edmond
Koh, Yun Sing
[J]. AI 2019: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, 11919 : 406 - 417
[39] Efficient methods for biomedical named entity recognition
Chan, Shing-Kit
Lam, Wai
[J]. PROCEEDINGS OF THE 7TH IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, VOLS I AND II, 2007, : 729 - 735
[40] Biomedical Named Entity Recognition with Less Supervision
Ghiasvand, Omid
Kate, Rohit J.
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2015), 2015, : 495 - 495

← 1 2 3 4 5 →