Morpheme Matching Based Text Tokenization for a Scarce Resourced Language

被引:9
|
作者
Rehman, Zobia [1 ]
Anwar, Waqas [1 ,2 ]
Bajwa, Usama Ijaz [1 ]
Wang Xuan [2 ]
Zhou Chaoying [2 ]
机构
[1] COMSATS Inst Informat Technol, Dept Comp Sci, Abbottabad, Pakistan
[2] Harbin Inst Technol, Grad Sch, Shenzhen, Peoples R China
来源
PLOS ONE | 2013年 / 8卷 / 08期
关键词
D O I
10.1371/journal.pone.0068178
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] A Multimodal Text Matching Model for Obfuscated Language Identification in Adversarial Communication
    Huang, Longtao
    Ma, Ting
    Lin, Junyu
    Han, Jizhong
    Hu, Songlin
    [J]. WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 2844 - 2850
  • [32] Enhanced Text Matching Based on Semantic Transformation
    Zhang, Shutao
    Tan, Haibo
    Chen, Liangfeng
    Lv, Bo
    [J]. IEEE ACCESS, 2020, 8 : 30897 - 30904
  • [33] The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units
    Lee, Ju-Sang
    Shin, Joon-Choul
    Ock, Choel-Young
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (20):
  • [34] Context-Aware Text Matching Algorithm for Korean Peninsula Language Knowledge Base Based on Density Clustering
    Li, Xiang
    Li, ZongXun
    [J]. MOBILE INFORMATION SYSTEMS, 2021, 2021
  • [35] Integrating Language Guidance Into Image-Text Matching for Correcting False Negatives
    Li, Zheng
    Guo, Caili
    Feng, Zerun
    Hwang, Jenq-Neng
    Du, Zhongtian
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 103 - 116
  • [36] Combining Text Classification and Text Matching for FAQ-Based Question Answering
    Mo Q.
    Wang X.-J.
    [J]. Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2019, 42 (04): : 76 - 81
  • [37] A Robust Morpheme Sequence and Convolutional Neural Network-Based Uyghur and Kazakh Short Text Classification
    Parhat, Sardar
    Ablimit, Mijit
    Hamdulla, Askar
    [J]. INFORMATION, 2019, 10 (12)
  • [38] Phrase-based pattern matching in compressed text
    Culpepper, J. Shane
    Moffat, Alistair
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2006, 4209 : 337 - 345
  • [39] Text Matching Based on Reconstructed Color Interaction Image
    Nie, Haohao
    Sun, Tanfeng
    Jiang, Xinghao
    Xu, Ke
    [J]. ELEVENTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2019, 11384
  • [40] Text Semantic Matching Research Based on Parallel Dropout
    Li, Zhuangzhuang
    Shao, Zengzhen
    Xiao, Jianxin
    Yu, Zixiao
    Zhang, Xu
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT V, 2023, 14258 : 548 - 559