Morpheme Matching Based Text Tokenization for a Scarce Resourced Language

被引:9
|
作者
Rehman, Zobia [1 ]
Anwar, Waqas [1 ,2 ]
Bajwa, Usama Ijaz [1 ]
Wang Xuan [2 ]
Zhou Chaoying [2 ]
机构
[1] COMSATS Inst Informat Technol, Dept Comp Sci, Abbottabad, Pakistan
[2] Harbin Inst Technol, Grad Sch, Shenzhen, Peoples R China
来源
PLOS ONE | 2013年 / 8卷 / 08期
关键词
D O I
10.1371/journal.pone.0068178
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Stochastic Tokenization with a Language Model for Neural Text Classification
    Hiraoka, Tatsuya
    Shindo, Hiroyuki
    Matsumoto, Yuji
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1620 - 1629
  • [2] Language-Independent Text Tokenization Using Unsupervised Deep Learning
    Mahmoud, Hanan A. Hosni
    Hafez, Alaaeldin M.
    Alabdulkreem, Eatedal
    [J]. INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 35 (01): : 321 - 334
  • [3] Character N-Gram Tokenization for European Language Text Retrieval
    Paul McNamee
    James Mayfield
    [J]. Information Retrieval, 2004, 7 : 73 - 97
  • [4] Character N-gram tokenization for European language text retrieval
    McNamee, P
    Mayfield, J
    [J]. INFORMATION RETRIEVAL, 2004, 7 (1-2): : 73 - 97
  • [5] Tokenization-based data augmentation for text classification
    Prakrankamanant, Patawee
    Chuangsuwanich, Ekapol
    [J]. 2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022), 2022,
  • [6] Text-based Language Identification for Some of the Under-resourced Languages of South Africa
    Sefara, Tshephisho Joseph
    Manamela, Madimetja Jonas
    Malatji, Promise Tshepiso
    [J]. 2016 THIRD INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND ENGINEERING (ICACCE 2016), 2016, : 303 - 307
  • [7] Morpheme-based Korean text cohesion analyzer
    Kim, Dong-Hyun
    Ahn, Seokho
    Lee, Euijong
    Seo, Young-Duk
    [J]. SOFTWAREX, 2024, 26
  • [8] Morpheme based language models for speech recognition of Czech
    Byrne, W
    Hajic, J
    Ircing, P
    Krbec, P
    Psutka, J
    [J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2000, 1902 : 211 - 216
  • [9] Sentence Tokenization Using Statistical Unsupervised Machine Learning and Rule-Based Approach for Running Text in Gujarati Language
    Tailor, Chetana
    Patel, Bankim
    [J]. EMERGING TRENDS IN EXPERT APPLICATIONS AND SECURITY, 2019, 841 : 319 - 326
  • [10] Morpheme-based language modeling for Arabic LVCSR
    Choueiter, Ghinwa
    Povey, Daniel
    Chen, Stanley F.
    Zweig, Geoffrey
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 1053 - 1056