isiZulu Word Embeddings

被引:0
|
作者
Dlamini, Sibonelo [1 ]
Jembere, Edgar [1 ]
Pillay, Anban [1 ]
van Niekerk, Brett [1 ]
机构
[1] Univ KwaZulu Natal, Dept Comp Sci, Durban, South Africa
关键词
isiZulu; word embeddings; semantic relatedness; agglutinative language; subword embeddings;
D O I
10.1109/ICTAS50802.2021.9395011
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Word embeddings are currently the most popular vector space model in Natural Language Processing. How we encode words is important because it affects the performance of many downstream tasks such as Machine Translation (MT), Information Retrieval (IR) and Automatic Speech Recognition (ASR). While much focus has been placed on constructing word embeddings for English, very little attention is paid to under-resourced languages, especially native African languages. In this paper we select four popular word embedding models (Word2Vec CBOW and Skip-Gram; FastText and GloVe) and train them on the 10 million token isiZulu National Corpus (INC) to create isiZulu word embeddings. To the best of our knowledge, this is the first time that word embeddings in isiZulu have been constructed and made available to the public. We create a semantic similarity data set analogous to WordSim353, which we also make publicly available. This data set is used to conduct an evaluation of the four models to determine which is the best for creating isiZulu word embeddings in a low-resource (small corpus) setting. We found that the Word2Vec Skip-Gram model produced the highest quality embeddings, as measured by this semantic similarity task. However, it was the GloVe model which performed best on the nearest neighbours task.
引用
收藏
页码:121 / 126
页数:6
相关论文
共 50 条
  • [41] Relation Reconstructive Binarization of word embeddings
    Feiyang Pan
    Shuokai Li
    Xiang Ao
    Qing He
    Frontiers of Computer Science, 2022, 16
  • [42] Distributed Negative Sampling for Word Embeddings
    Stergiou, Stergios
    Straznickas, Zygimantas
    Wu, Rolina
    Tsioutsiouliklis, Kostas
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2569 - 2575
  • [43] Cross-Lingual Word Embeddings
    Søgaard A.
    Vulić I.
    Ruder S.
    Faruqui M.
    Synthesis Lectures on Human Language Technologies, 2019, 12 (02): : 1 - 132
  • [44] Joint Learning of Character and Word Embeddings
    Chen, Xinxiong
    Xu, Lei
    Liu, Zhiyuan
    Sun, Maosong
    Luan, Huanbo
    PROCEEDINGS OF THE TWENTY-FOURTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI), 2015, : 1236 - 1242
  • [45] Word Embeddings for Arabic Sentiment Analysis
    Altowayan, A. Aziz
    Tao, Lixin
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3820 - 3825
  • [46] On the Downstream Performance of Compressed Word Embeddings
    May, Avner
    Zhang, Jian
    Dao, Tri
    Re, Christopher
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [47] Morphological Smoothing and Extrapolation of Word Embeddings
    Cotterell, Ryan
    Schutze, Hinrich
    Eisner, Jason
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1651 - 1660
  • [48] Addressing Noise in Multidialectal Word Embeddings
    Erdmann, Alexander
    Zalmout, Nasser
    Habash, Nizar
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 558 - 565
  • [49] From Word Embeddings To Document Distances
    Kusner, Matt J.
    Sun, Yu
    Kolkin, Nicholas I.
    Weinberger, Kilian Q.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 37, 2015, 37 : 957 - 966
  • [50] Invariance and identifiability issues for word embeddings
    Carrington, Rachel
    Bharath, Karthik
    Preston, Simon
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32