isiZulu Word Embeddings

被引:0
|
作者
Dlamini, Sibonelo [1 ]
Jembere, Edgar [1 ]
Pillay, Anban [1 ]
van Niekerk, Brett [1 ]
机构
[1] Univ KwaZulu Natal, Dept Comp Sci, Durban, South Africa
关键词
isiZulu; word embeddings; semantic relatedness; agglutinative language; subword embeddings;
D O I
10.1109/ICTAS50802.2021.9395011
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Word embeddings are currently the most popular vector space model in Natural Language Processing. How we encode words is important because it affects the performance of many downstream tasks such as Machine Translation (MT), Information Retrieval (IR) and Automatic Speech Recognition (ASR). While much focus has been placed on constructing word embeddings for English, very little attention is paid to under-resourced languages, especially native African languages. In this paper we select four popular word embedding models (Word2Vec CBOW and Skip-Gram; FastText and GloVe) and train them on the 10 million token isiZulu National Corpus (INC) to create isiZulu word embeddings. To the best of our knowledge, this is the first time that word embeddings in isiZulu have been constructed and made available to the public. We create a semantic similarity data set analogous to WordSim353, which we also make publicly available. This data set is used to conduct an evaluation of the four models to determine which is the best for creating isiZulu word embeddings in a low-resource (small corpus) setting. We found that the Word2Vec Skip-Gram model produced the highest quality embeddings, as measured by this semantic similarity task. However, it was the GloVe model which performed best on the nearest neighbours task.
引用
收藏
页码:121 / 126
页数:6
相关论文
共 50 条
  • [1] Synthesising isiZulu-English code-switch bigrams using word embeddings
    van der Westhuizen, Ewald
    Niesler, Thomas
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 72 - 76
  • [2] Socialized Word Embeddings
    Zeng, Ziqian
    Yin, Yichun
    Song, Yangqiu
    Zhang, Ming
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3915 - 3921
  • [3] Urdu Word Embeddings
    Haider, Samar
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 964 - 968
  • [4] Dynamic Word Embeddings
    Bamler, Robert
    Mandt, Stephan
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [5] Topical Word Embeddings
    Liu, Yang
    Liu, Zhiyuan
    Chua, Tat-Seng
    Sun, Maosong
    [J]. PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 2418 - 2424
  • [6] Bias in Word Embeddings
    Papakyriakopoulos, Orestis
    Hegelich, Simon
    Serrano, Juan Carlos Medina
    Marco, Fabienne
    [J]. FAT* '20: PROCEEDINGS OF THE 2020 CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, 2020, : 446 - 457
  • [7] Compressing Word Embeddings
    Andrews, Martin
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2016, PT IV, 2016, 9950 : 413 - 422
  • [8] Relational Word Embeddings
    Camacho-Collados, Jose
    Espinosa-Anke, Luis
    Schockaert, Steven
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3286 - 3296
  • [9] Biomedical Word Sense Disambiguation with Word Embeddings
    Antunes, Rui
    Matos, Sergio
    [J]. 11TH INTERNATIONAL CONFERENCE ON PRACTICAL APPLICATIONS OF COMPUTATIONAL BIOLOGY & BIOINFORMATICS, 2017, 616 : 273 - 279
  • [10] Overcoming Poor Word Embeddings with Word Definitions
    Malon, Christopher
    [J]. 10TH CONFERENCE ON LEXICAL AND COMPUTATIONAL SEMANTICS (SEM 2021), 2021, : 288 - 293