Text-based Language Identification for Some of the Under-resourced Languages of South Africa

被引:0
|
作者
Sefara, Tshephisho Joseph [1 ]
Manamela, Madimetja Jonas [1 ]
Malatji, Promise Tshepiso [1 ]
机构
[1] Univ Limpopo Polokwane, Dept Comp Sci, Telkom Ctr Excellence Speech Technol, Polokwane, South Africa
关键词
support vector machines; multinomial naive Bayes; WEKA; machine learning; text classification; language identification; multiclass classification;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Language identification is the problem of correctlyclassifying a sample of text/documents based on its language. However, much of the research work focused on the English language corpora and little research work focused on other South African official languages. In a multilingual society like South Africa, the use of automatic language identification in any language-specific system would be a vital step in bridging the digital divide between diverse members of the society. Various machine learning algorithms can be used to solve the problem of identifying the natural language of a document/text. This paper presents a text-based language identification using individual proper names, specifically surnames in a South African context. Three supervised machine learning methods are implemented to perform 3-way multiclass classification using support vector machines, and naive Bayes language models. These algorithms are applied to the language identification task and evaluated in extensive experiments for three official languages of South Africa: Tshivenda, Xitsonga and Sepedi. All three machine learningmethods achieved remarkable results in a 10-fold cross validation. The results indicate that a multinomial naive Bayes method achieved better performance than other algorithms.
引用
收藏
页码:303 / 307
页数:5
相关论文
共 50 条
  • [1] Language Identification for Under-Resourced Languages in the Basque Context
    Barroso, Nora
    de Ipina, Karmele Lopez
    Grana, Manuel
    Ezeiza, Aitzol
    SOFT COMPUTING MODELS IN INDUSTRIAL AND ENVIRONMENTAL APPLICATIONS, 6TH INTERNATIONAL CONFERENCE SOCO 2011, 2011, 87 : 475 - 483
  • [2] Text-based language identification for South African languages
    Botha, Gerrit
    Zimu, Victor
    Barnard, Etienne
    SAIEE Africa Research Journal, 2007, 98 (04) : 141 - 148
  • [3] Word-length algorithm for language identification of under-resourced languages
    Selamat, Ali
    Akosu, Nicholas
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2016, 28 (04) : 457 - 469
  • [4] Mismatched Crowdsourcing based Language Perception for Under-resourced Languages
    Chen, Wenda
    Hasegawa-Johnson, Mark
    Chen, Nancy F.
    SLTU-2016 5TH WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGIES FOR UNDER-RESOURCED LANGUAGES, 2016, 81 : 23 - 29
  • [5] Language Modeling for Speech Analytics in Under-Resourced Languages
    Wills, Simone
    Uys, Pieter
    van Heerden, Charl
    Barnard, Etienne
    INTERSPEECH 2020, 2020, : 4941 - 4945
  • [6] Eigentrigraphemes for under-resourced languages
    Ko, Tom
    Mak, Brian
    SPEECH COMMUNICATION, 2014, 56 : 132 - 141
  • [7] Text Spotting In Large Speech Databases For Under-Resourced Languages
    Buzo, Andi
    Cucu, Horia
    Burileanu, Corneliu
    2013 7TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN - COMPUTER DIALOGUE (SPED), 2013,
  • [8] The LREMap for Under-Resourced Languages
    Del Gratta, Riccardo
    Frontini, Francesca
    Khan, Anas Fahad
    Mariani, Joseph
    Soria, Claudia
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [9] Automatic processing of under-resourced languages
    Bernhard, Delphine
    Soria, Claudia
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2018, 59 (03): : 7 - 14
  • [10] ASR and translation for under-resourced languages
    Besacier, L.
    Le, V-B.
    Boitet, C.
    Berment, V.
    2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 6079 - 6082