Morpheme Matching Based Text Tokenization for a Scarce Resourced Language

被引：9

作者：

Rehman, Zobia ^{[1
]}

Anwar, Waqas ^{[1
,2
]}

Bajwa, Usama Ijaz ^{[1
]}

Wang Xuan ^{[2
]}

Zhou Chaoying ^{[2
]}

机构：

[1] COMSATS Inst Informat Technol, Dept Comp Sci, Abbottabad, Pakistan

[2] Harbin Inst Technol, Grad Sch, Shenzhen, Peoples R China

来源：

PLOS ONE | 2013年 / 8卷 / 08期

关键词：

D O I：

10.1371/journal.pone.0068178

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.

引用

页数：8

共 50 条

[1] Stochastic Tokenization with a Language Model for Neural Text Classification
Hiraoka, Tatsuya
Shindo, Hiroyuki
Matsumoto, Yuji
[J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1620 - 1629
[2] Language-Independent Text Tokenization Using Unsupervised Deep Learning
Mahmoud, Hanan A. Hosni
Hafez, Alaaeldin M.
Alabdulkreem, Eatedal
[J]. INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 35 (01): : 321 - 334
[3] Character N-Gram Tokenization for European Language Text Retrieval
Paul McNamee
James Mayfield
[J]. Information Retrieval, 2004, 7 : 73 - 97
[4] Character N-gram tokenization for European language text retrieval
McNamee, P
Mayfield, J
[J]. INFORMATION RETRIEVAL, 2004, 7 (1-2): : 73 - 97
[5] Tokenization-based data augmentation for text classification
Prakrankamanant, Patawee
Chuangsuwanich, Ekapol
[J]. 2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022), 2022,
[6] Text-based Language Identification for Some of the Under-resourced Languages of South Africa
Sefara, Tshephisho Joseph
Manamela, Madimetja Jonas
Malatji, Promise Tshepiso
[J]. 2016 THIRD INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND ENGINEERING (ICACCE 2016), 2016, : 303 - 307
[7] Morpheme-based Korean text cohesion analyzer
Kim, Dong-Hyun
Ahn, Seokho
Lee, Euijong
Seo, Young-Duk
[J]. SOFTWAREX, 2024, 26
[8] Morpheme based language models for speech recognition of Czech
Byrne, W
Hajic, J
Ircing, P
Krbec, P
Psutka, J
[J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2000, 1902 : 211 - 216
[9] Sentence Tokenization Using Statistical Unsupervised Machine Learning and Rule-Based Approach for Running Text in Gujarati Language
Tailor, Chetana
Patel, Bankim
[J]. EMERGING TRENDS IN EXPERT APPLICATIONS AND SECURITY, 2019, 841 : 319 - 326
[10] Morpheme-based language modeling for Arabic LVCSR
Choueiter, Ghinwa
Povey, Daniel
Chen, Stanley F.
Zweig, Geoffrey
[J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 1053 - 1056

← 1 2 3 4 5 →