Word-Based Bantu Language Identification using Naive Bayes

被引:0
|
作者
Okgetheng, Boago [1 ]
Budu, Emmanuella A. W. [1 ]
机构
[1] Univ Botswana, Dept Comp Sci, Gaborone, Botswana
来源
关键词
Language Identification; NLP; Naive Bayes; Setswana; Sesotho;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Language identification of text has become increasingly important as large quantities of text are processed or filtered automatically. It is one of the preprocessing steps in Natural Language Processing (NLP) tasks such as information retrieval and machine translation. Few studies have worked on Bantu Languages in automatic language identification. Language identification is a challenge in Bantu languages because of lack of data and in addition to that, languages which are written similarly like Setswana and Sesotho are also challenging. In this paper, we present a word-based Naive Bayes classifier to identify words of Sesotho and Setswana language. The classifier was trained with words from both Setswana and Sesotho in a supervised manner. Adjectives, pronouns, adverbs and enumeratives are also included. The classifier shows that the two languages can be individually identified as it gives an accuracy of 71.4%. Despite that when we increase the data to double the number of words, the model increased performance to 78%. We also report that the classifier fails with homographs. The performance could be improved by using more data. Additionally, the syllable identification and sentence identification could be implemented along with word-based classifier.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] A Word-Based Naive Bayes Classifier for Confidence Estimation in Speech Recognition
    Sanchis, Alberto
    Juan, Alfons
    Vidal, Enrique
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02): : 565 - 574
  • [2] WHIRL: A word-based information representation language
    Cohen, WW
    [J]. ARTIFICIAL INTELLIGENCE, 2000, 118 (1-2) : 163 - 196
  • [3] A Bit Progress on Word-Based Language Model
    陈勇
    陈国评
    [J]. Advances in Manufacturing, 2003, (02) : 148 - 155
  • [4] NAIVE BAYES CLASSIFIER FOR WORD SENSE DISAMBIGUATION OF PUNJABI LANGUAGE
    Singh, Varinder Pal
    Kumar, Parteek
    [J]. MALAYSIAN JOURNAL OF COMPUTER SCIENCE, 2018, 31 (03) : 188 - 199
  • [5] Data integration using similarity joins and a word-based information representation language
    Cohen, WW
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2000, 18 (03) : 288 - 321
  • [6] Combination of word-based and category-based language models
    Niesler, TR
    Woodland, PC
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 220 - 223
  • [7] Word-Based Self-Indexes for Natural Language Text
    Farina, Antonio
    Brisaboa, Nieves R.
    Navarro, Gonzalo
    Claude, Francisco
    Places, Angeles S.
    Rodriguez, Eduardo
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2012, 30 (01)
  • [8] Word-based morphology
    Blevins, James P.
    [J]. JOURNAL OF LINGUISTICS, 2006, 42 (03) : 531 - 573
  • [9] Word-based statistical compressors as natural language compression boosters
    Farina, Antonio
    Navarro, Gonzalo
    Parama, Jose R.
    [J]. DCC: 2008 DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2008, : 162 - +
  • [10] A word-based predictive text entry method for Khmer language
    Ouk, Phavy
    Thu, Ye Kyaw
    Matsumoto, Mitsuji
    Urano, Yoshiyori
    [J]. PROCEEDINGS OF THE 2008 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2008, : 214 - 219