Development and Evaluation of Word Embeddings for Morphologically Rich Languages

被引:0
|
作者
Vasic, Daniel [1 ]
Brajkovic, Emil [1 ]
机构
[1] Univ Mostar, Fac Sci & Educ, Mostar 88000, Bosnia & Herceg
关键词
Natural language processing; Neural network models; Morphologically rich languages; Word embeddings; Intelligent tutoring systems;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Recent advancements in natural language processing (NLP) improved many systems that were relying on natural language to achieve better communication with user using this system. One of the main problem in application of NLP in multiple languages is lack of tools that can be used to develop such systems. Croatian language is highly inflected language from Slavic language family and traditional models used that give great results for English language behave poorly for morphology rich languages. In this article we present model for creating word embeddings for morphologically rich languages such as Croatian. We evaluate the generated word embeddings on newly created word similarity corpus, that is based on English similarity corpus. In the evaluation of word embeddings we compare with two of the best word representation models for English language. We also evaluate our approach with multi-language models such as FastText. The word embeddings created in this article will be used for developing component in training neural models for semantic understanding of sentences written in Croatian language. These language tools can be utilized in many systems where natural language understanding (NLU) and natural language generation (NLG) is needed. In the introduction we give global insight about word embeddings, what are the models for creating such representations and where these representations could be used. In the second section we mention some of the best models for creating word embeddings. In the third section we give a framework for development and evaluation of word embeddings for Croatian language. In the conclusion we emphasis the importance of developing tools in Croatian language and announcement of future research.
引用
收藏
页码:327 / 331
页数:5
相关论文
共 50 条
  • [1] Grapheme-level Awareness in Word Embeddings for Morphologically Rich Languages
    Park, Suzi
    Shin, Hyopil
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2974 - 2980
  • [2] Improving Named Entity Recognition for Morphologically Rich Languages using Word Embeddings
    Demir, Hakan
    Ozgur, Arzucan
    [J]. 2014 13TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2014, : 117 - 122
  • [3] Word Semantic Similarity for Morphologically Rich Languages
    Zervanou, Kalliopi
    Iosif, Elias
    Potamianos, Alexandros
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1642 - 1648
  • [4] Recovering Word Forms by Context for Morphologically Rich Languages
    Alekseev A.M.
    Nikolenko S.I.
    [J]. Journal of Mathematical Sciences, 2023, 273 (4) : 527 - 532
  • [5] Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language
    Toqeer Ehsan
    Javairia Khalid
    Saadia Ambreen
    Asad Mustafa
    Sarmad Hussain
    [J]. Arabian Journal for Science and Engineering, 2022, 47 : 9781 - 9799
  • [6] Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language
    Ehsan, Toqeer
    Khalid, Javairia
    Ambreen, Saadia
    Mustafa, Asad
    Hussain, Sarmad
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2022, 47 (08) : 9781 - 9799
  • [7] Morphologically-Aware Vocabulary Reduction of Word Embeddings
    Chia, Chong Cher
    Tkachenko, Maksim
    Lauw, Hady W.
    [J]. 2022 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WI-IAT, 2022, : 56 - 63
  • [8] Constituency Parse Reranking for Morphologically Rich Languages
    Szanto, Zsolt
    Farkas, Richard
    [J]. ACTA POLYTECHNICA HUNGARICA, 2015, 12 (08) : 81 - 94
  • [9] Neural Machine Translation for Morphologically Rich Languages with Improved Sub-word Units and Synthetic Data
    Pinnis, Marcis
    Krislauks, Rihards
    Deksne, Daiga
    Miks, Toms
    [J]. TEXT, SPEECH, AND DIALOGUE, TSD 2017, 2017, 10415 : 237 - 245
  • [10] Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages
    Manghat, Sreeja
    Manghat, Sreeram
    Schultz, Tanja
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6122 - 6126