The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding

被引:5
|
作者
Neuraz, Antoine [1 ,2 ]
Rance, Bastien [1 ]
Garcelon, Nicolas [1 ]
Llanos, Leonardo Campillos [2 ]
Burgun, Anita [1 ]
Rosset, Sophie [2 ]
机构
[1] Paris Descartes, UMR 1138, INSERM, Team 22, Paris, France
[2] Univ Paris Saclay, CNRS, LIMSI, Paris, France
来源
关键词
Natural Language processing; Contextual word embeddings; Natural language understanding;
D O I
10.3233/SHTI200197
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Recent studies in the biomedical domain suggest that learning statistical word representations (static or contextualized word embeddings) on large corpora of specialized data improve the results on downstream natural language processing (NLP) tasks. In this paper, we explore the impact of the data source of word representations on a natural language understanding task. We compared embeddings learned with Fasttext (static embedding) and ELMo (contextualized embedding) representations, learned either on the general domain (Wikipedia) or on specialized data (electronic health records, EHR). The best results were obtained with ELMo representations learned on EHR data for the two sub-tasks (+7% and + 4% of gain in F1-score). Moreover, ELMo representations were trained with only a fraction of the data used for Fasttext.
引用
下载
收藏
页码:432 / 436
页数:5
相关论文
共 50 条
  • [21] Inferring Multilingual Domain-Specific Word Embeddings From Large Document Corpora
    Cagliero, Luca
    La Quatra, Moreno
    IEEE ACCESS, 2021, 9 : 137309 - 137321
  • [22] Word embeddings for biomedical natural language processing: A survey
    Chiu, Billy
    Baker, Simon
    LANGUAGE AND LINGUISTICS COMPASS, 2020, 14 (12):
  • [23] A comparison of word embeddings for the biomedical natural language processing
    Wang, Yanshan
    Liu, Sijia
    Afzal, Naveed
    Rastegar-Mojarad, Majid
    Wang, Liwei
    Shen, Feichen
    Kingsbury, Paul
    Liu, Hongfang
    JOURNAL OF BIOMEDICAL INFORMATICS, 2018, 87 : 12 - 20
  • [24] The impact of corpus domain on word representation: a study on Persian word embeddings
    Hadifar, Amir
    Momtazi, Saeedeh
    LANGUAGE RESOURCES AND EVALUATION, 2018, 52 (04) : 997 - 1019
  • [25] The impact of corpus domain on word representation: a study on Persian word embeddings
    Amir Hadifar
    Saeedeh Momtazi
    Language Resources and Evaluation, 2018, 52 : 997 - 1019
  • [26] Towards Understanding Word Embeddings: Automatically Explaining Similarity of Terms
    Zhang, Yating
    Jatowt, Adam
    Tanaka, Katsumi
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 823 - 832
  • [27] A Hybrid Semantic Relatedness Algorithm by Entity Co-Occurrence and Specialized Word Embeddings
    Heo, Go Eun
    Xie, Qing
    2019 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI), 2019, : 478 - 479
  • [28] Definition Modeling: Learning to Define Word Embeddings in Natural Language
    Noraset, Thanapon
    Liang, Chen
    Birnbaum, Larry
    Downey, Doug
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3259 - 3266
  • [29] Dissecting word embeddings and language models in natural language processing
    Verma, Vivek Kumar
    Pandey, Mrigank
    Jain, Tarun
    Tiwari, Pradeep Kumar
    JOURNAL OF DISCRETE MATHEMATICAL SCIENCES & CRYPTOGRAPHY, 2021, 24 (05): : 1509 - 1515
  • [30] Domain specific word embeddings for natural language processing in radiology
    Chen, Timothy L.
    Emerling, Max
    Chaudhari, Gunvant R.
    Chillakuru, Yeshwant R.
    Seo, Youngho
    Vu, Thienkhai H.
    Sohn, Jae Ho
    JOURNAL OF BIOMEDICAL INFORMATICS, 2021, 113