The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding

被引:5
|
作者
Neuraz, Antoine [1 ,2 ]
Rance, Bastien [1 ]
Garcelon, Nicolas [1 ]
Llanos, Leonardo Campillos [2 ]
Burgun, Anita [1 ]
Rosset, Sophie [2 ]
机构
[1] Paris Descartes, UMR 1138, INSERM, Team 22, Paris, France
[2] Univ Paris Saclay, CNRS, LIMSI, Paris, France
来源
关键词
Natural Language processing; Contextual word embeddings; Natural language understanding;
D O I
10.3233/SHTI200197
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Recent studies in the biomedical domain suggest that learning statistical word representations (static or contextualized word embeddings) on large corpora of specialized data improve the results on downstream natural language processing (NLP) tasks. In this paper, we explore the impact of the data source of word representations on a natural language understanding task. We compared embeddings learned with Fasttext (static embedding) and ELMo (contextualized embedding) representations, learned either on the general domain (Wikipedia) or on specialized data (electronic health records, EHR). The best results were obtained with ELMo representations learned on EHR data for the two sub-tasks (+7% and + 4% of gain in F1-score). Moreover, ELMo representations were trained with only a fraction of the data used for Fasttext.
引用
下载
收藏
页码:432 / 436
页数:5
相关论文
共 50 条
  • [1] Asynchronous Training of Word Embeddings for Large Text Corpora
    Anand, Avishek
    Khosla, Megha
    Singh, Jaspreet
    Zab, Jan-Hendrik
    Zhang, Zijian
    PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 168 - 176
  • [2] Enriching Word Embeddings with a Regressor Instead of Labeled Corpora
    Abdalla, Mohamed
    Sahlgren, Magnus
    Hirst, Graeme
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 6188 - 6195
  • [3] Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora
    Rheault, Ludovic
    Cochrane, Christopher
    POLITICAL ANALYSIS, 2020, 28 (01) : 112 - 133
  • [4] Explaining Financial Uncertainty through Specialized Word Embeddings
    Theil, Christoph Kilian
    Štajner, Sanja
    Stuckenschmidt, Heiner
    ACM/IMS Transactions on Data Science, 2020, 1 (01):
  • [5] Word Alignment by Fine-tuning Embeddings on Parallel Corpora
    Dou, Zi-Yi
    Neubig, Graham
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2112 - 2128
  • [6] Understanding the Origins of Bias in Word Embeddings
    Brunet, Marc-Etienne
    Alkalay-Houlihan, Colleen
    Anderson, Ashton
    Zemel, Richard
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [7] Temporal Word Embeddings for Narrative Understanding
    Volpetti, Claudia
    Vani, K.
    Antonucci, Alessandro
    ICMLC 2020: 2020 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2018, : 68 - 72
  • [8] HistorEx: Exploring Historical Text Corpora Using Word and Document Embeddings
    Mueller, Sven
    Brunzel, Michael
    Kaun, Daniela
    Biswas, Russa
    Koutraki, Maria
    Tietz, Tabea
    Sack, Harald
    SEMANTIC WEB: ESWC 2019 SATELLITE EVENTS, 2019, 11762 : 136 - 140
  • [9] Lexical Comparison Between Wikipedia and Twitter Corpora by Using Word Embeddings
    Tan, Luchen
    Zhang, Haotian
    Clarke, Charles L. A.
    Smucker, Mark D.
    PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 657 - 661
  • [10] Analogies Explained: Towards Understanding Word Embeddings
    Allen, Carl
    Hospedales, Timothy
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97