Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging

被引:16
|
作者
Ding, Chenchen [1 ]
Aye, Hnin Thu Zar [2 ]
Pa, Win Pa [2 ]
Nwet, Khin Thandar [2 ,3 ]
Soe, Khin Mar [2 ]
Utiyama, Masao [1 ]
Sumita, Eiichiro [1 ]
机构
[1] Natl Inst Informat & Commun Technol, ASTREC, 3-5 Hikaridai,Seika Cho, Kyoto 6190289, Japan
[2] Univ Comp Studies, 4 Main Rd, Yangon, Myanmar
[3] Univ Informat Technol, Mandalay, Myanmar
关键词
Burmese (Myanmar); annotated corpus; tokenization; POS-tagging; morphological analysis; CRF; LSTM-based RNN;
D O I
10.1145/3325885
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-tagging information, as one component of the Asian Language Treebank Project. The annotated corpus has been released under a CC BY-NC-SA license, and it is the largest open-access database of annotated Burmese when this manuscript was prepared in 2017. Detailed descriptions of the preparation, refinement, and features of the annotated corpus are provided in the first half of the article. Facilitated by the annotated corpus, experiment-based investigations are presented in the second half of the article, wherein the standard sequence-labeling approach of conditional random fields and a long short-term memory (LSTM)-based recurrent neural network (RNN) are applied and discussed. We obtained several general conclusions, covering the effect of joint tokenization and POS-tagging and importance of ensemble from the viewpoint of stabilizing the performance of LSTM-based RNN. This study provides a solid basis for further studies on Burmese processing.
引用
收藏
页数:34
相关论文
共 50 条
  • [1] Syllable-Based Multi-POSMORPH Annotation for Korean Morphological Analysis and Part-of-Speech Tagging
    Shin, Hyeong Jin
    Park, Jeongyeon
    Lee, Jae Sung
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (05):
  • [2] Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion
    Kaing, Hour
    Ding, Chenchen
    Utiyama, Masao
    Sumita, Eiichiro
    Sam, Sethserey
    Seng, Sopheap
    Sudoh, Katsuhito
    Nakamura, Satoshi
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (06)
  • [3] Morphological Analysis Based Part-of-Speech Tagging for Uyghur Speech Synthesis
    Mamateli, Guljamal
    Rozi, Askar
    Ali, Gulnar
    Hamdulla, Askar
    [J]. KNOWLEDGE ENGINEERING AND MANAGEMENT, 2011, 123 : 389 - +
  • [4] Syllable-based Myanmar Language Model for Speech Recognition
    Soe, Wunna
    Thein, Yadana
    [J]. 2015 IEEE/ACIS 14TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), 2015, : 291 - 296
  • [5] Improving Part-of-Speech Tagging Accuracy for Croatian by Morphological Analysis
    Agic, Zeljko
    Dovedan, Zdravko
    Tadic, Marko
    [J]. INFORMATICA-JOURNAL OF COMPUTING AND INFORMATICS, 2008, 32 (04): : 445 - 451
  • [6] Improving Arabic Part-of-Speech Tagging through Morphological Analysis
    Albared, Mohammed
    Omar, Nazlia
    Ab Aziz, Mohd. Juzaiddin
    [J]. INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2011, PT I, 2011, 6591 : 317 - 326
  • [7] Improving Part-of-Speech Tagging Accuracy for Croatian by Morphological Analysis
    Agic, Zeljko
    Dovedan, Zdravko
    Tadic, Marko
    [J]. INFORMATICA-JOURNAL OF COMPUTING AND INFORMATICS, 2009, 33 (02): : 161 - 167
  • [8] Improving part-of-speech tagging accuracy for croatian by morphological analysis
    Agic, Zeljko
    Dovedan, Zdravko
    Tadic, Marko
    [J]. Informatica (Ljubljana), 2009, 33 (02) : 169 - 176
  • [9] NOVA: A Feasible and Flexible Annotation System for Joint Tokenization and Part-of-Speech Tagging
    Ding, Chenchen
    Utiyama, Masao
    Sumita, Eiichiro
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2019, 18 (02)
  • [10] Corpus based part-of-speech tagging
    Lv, Chengyao
    Liu, Huihua
    Dong, Yuanxing
    Chen, Yunliang
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2016, 19 (03) : 647 - 654