isiZulu Word Embeddings

被引:0
|
作者
Dlamini, Sibonelo [1 ]
Jembere, Edgar [1 ]
Pillay, Anban [1 ]
van Niekerk, Brett [1 ]
机构
[1] Univ KwaZulu Natal, Dept Comp Sci, Durban, South Africa
关键词
isiZulu; word embeddings; semantic relatedness; agglutinative language; subword embeddings;
D O I
10.1109/ICTAS50802.2021.9395011
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Word embeddings are currently the most popular vector space model in Natural Language Processing. How we encode words is important because it affects the performance of many downstream tasks such as Machine Translation (MT), Information Retrieval (IR) and Automatic Speech Recognition (ASR). While much focus has been placed on constructing word embeddings for English, very little attention is paid to under-resourced languages, especially native African languages. In this paper we select four popular word embedding models (Word2Vec CBOW and Skip-Gram; FastText and GloVe) and train them on the 10 million token isiZulu National Corpus (INC) to create isiZulu word embeddings. To the best of our knowledge, this is the first time that word embeddings in isiZulu have been constructed and made available to the public. We create a semantic similarity data set analogous to WordSim353, which we also make publicly available. This data set is used to conduct an evaluation of the four models to determine which is the best for creating isiZulu word embeddings in a low-resource (small corpus) setting. We found that the Word2Vec Skip-Gram model produced the highest quality embeddings, as measured by this semantic similarity task. However, it was the GloVe model which performed best on the nearest neighbours task.
引用
收藏
页码:121 / 126
页数:6
相关论文
共 50 条
  • [21] Revisiting Supervised Word Embeddings
    Vu, Dieu
    Truong, Khang
    Nguyen, Khanh
    Van Linh, Ngo
    Than, Khoat
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2022, 38 (02) : 413 - 427
  • [22] Ontology Matching with Word Embeddings
    Zhang, Yuanzhe
    Wang, Xuepeng
    Lai, Siwei
    He, Shizhu
    Liu, Kang
    Zhao, Jun
    Lv, Xueqiang
    CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2014, 2014, 8801 : 34 - 45
  • [23] Exploring Numeracy in Word Embeddings
    Naik, Aakanksha
    Ravichander, Abhilasha
    Rose, Carolyn
    Hovy, Eduard
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3374 - 3380
  • [24] Word Embeddings with Limited Memory
    Ling, Shaoshi
    Song, Yangqiu
    Roth, Dan
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2016), VOL 2, 2016, : 387 - 392
  • [25] Word Embeddings for Comment Coherence
    Cimasa, Alfonso
    Corazza, Anna
    Coviello, Carmen
    Scanniello, Giuseppe
    2019 45TH EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS (SEAA 2019), 2019, : 244 - 251
  • [26] Chinese Word Embeddings with Subwords
    Yang, Gang
    Xu, Hongzhe
    Li, Wen
    2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
  • [27] Word Embeddings as Statistical Estimators
    Dey, Neil
    Singer, Matthew
    Williams, Jonathan P.
    Sengupta, Srijan
    SANKHYA-SERIES B-APPLIED AND INTERDISCIPLINARY STATISTICS, 2024, 86 (2): : 415 - 441
  • [28] Word Embeddings for the Polish Language
    Rogalski, Marek
    Szczepaniak, Piotr S.
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2016, 2016, 9692 : 126 - 135
  • [29] Word Embeddings Evaluation and Combination
    Ghannay, Sahar
    Favre, Benoit
    Esteve, Yannick
    Camelin, Nathalie
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 300 - 305
  • [30] Eigenwords: Spectral word embeddings
    Dhillon, Paramveer S.
    Foster, Dean P.
    Ungar, Lyle H.
    Journal of Machine Learning Research, 2015, 16 : 3035 - 3078