Comparing general and specialized word embeddings for biomedical named entity recognition

被引:3
|
作者
Ramos-Vargas, Rigo E. [1 ]
Roman-Godinez, Israel [1 ]
Torres-Ramos, Sulema [1 ]
机构
[1] Univ Guadalajara, Dept Ciencias Computac, Guadalajara, Jalisco, Mexico
关键词
Word embeddings; BioNER; BiLSTM-CRF; DrugBank; MedLine; Pyysalo PM + PMC; Glove common crawl; ELMo embeddings; Pooled flair embeddings; Transformer embeddings; RELATEDNESS;
D O I
10.7717/peerj-cs.384
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.
引用
收藏
页码:1 / 22
页数:22
相关论文
共 50 条
  • [1] Deep learning with word embeddings improves biomedical named entity recognition
    Habibi, Maryam
    Weber, Leon
    Neves, Mariana
    Wiegandt, David Luis
    Leser, Ulf
    [J]. BIOINFORMATICS, 2017, 33 (14) : I37 - I48
  • [2] Named Entity Recognition Only from Word Embeddings
    Luo, Ying
    Zhao, Hai
    Zhan, Junlang
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 8995 - 9005
  • [3] Combining Word Embeddings for Portuguese Named Entity Recognition
    da Silva, Messias Gomes
    Alves de Oliveira, Hilario Tomaz
    [J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 198 - 208
  • [4] LM-Based Word Embeddings Improve Biomedical Named Entity Recognition: A Detailed Analysis
    Akhtyamova, Liliya
    Cardiff, John
    [J]. BIOINFORMATICS AND BIOMEDICAL ENGINEERING (IWBBIO 2020), 2020, 12108 : 624 - 635
  • [5] Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition
    Unanue, Inigo Jauregi
    Borzeshi, Ehsan Zare
    Piccardi, Massimo
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2017, 76 : 102 - 109
  • [6] Shahmukhi named entity recognition by using contextualized word embeddings
    Tehseen, Amina
    Ehsan, Toqeer
    Bin Liaqat, Hannan
    Kong, Xiangjie
    Ali, Amjad
    Al-Fuqaha, Ala
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 229
  • [7] Application of specialized word embeddings and named entity and attribute recognition to the problem of unsupervised automated clinical coding
    Nath, Namrata
    Lee, Sang-Heon
    Lee, Ivan
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 165
  • [8] A deep neural framework for named entity recognition with boosted word embeddings
    Goyal, Archana
    Gupta, Vishal
    Kumar, Manish
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (06) : 15533 - 15546
  • [9] LearningToAdapt with word embeddings: Domain adaptation of Named Entity Recognition systems
    Nozza, Debora
    Manchanda, Pikakshi
    Fersini, Elisabetta
    Palmonari, Matteo
    Messina, Enza
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2021, 58 (03)
  • [10] Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings
    Zhai, Zenan
    Dat Quoc Nguyen
    Akhondi, Saber A.
    Thorne, Camilo
    Druckenbrodt, Christian
    Cohn, Trevor
    Gregory, Michelle
    Verspoor, Karin
    [J]. SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), 2019, : 328 - 338