Amharic Text Corpus based on Parts of Speech tagging and headwords

被引:0
|
作者
Abebe, Tsegaye [1 ]
Alemneh, Esubalew [1 ]
机构
[1] Bahir Dar Univ, Bahir Dar Inst Technol, ICT4D Res Ctr, Bahir Dar, Ethiopia
关键词
Amharic language; semi-automatic text tagger; corpus linguistics; part of speech tag; headwords;
D O I
10.1109/ICT4DA53266.2021.9672246
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Corpus is a milestone to study natural languages and to develop various tools for the processing of human languages. Since, few studies are carried out on the development of Amharic language corpus development, the existing corpora are very small in size and not well accessible for academicians as well as commercial and non-commercial organizations. This paper presents Amharic text corpus developed by applying the processes of annotating each word with its part of speech tag and reducing each orthographic word to its headword using either derivational or inflectional process. We extracted 12,720 sentences from various text documents collected in the domain of proclamations. Ethiopian 1987 E.C constitution and a few policies of Amhara regional state, Ethiopia and federal government of Ethiopia are some of those documents. We found 331,728 tokens from those sentences. 66 tag sets are compiled from base part of speech tag set classes and compound part of speech tag set classes based on different factors and representation of orthographic words. To help the manual annotation of each orthographic word, we developed a semi-automatic Amharic text tagger. The outputs of the research project are pre-processed Amharic text stored in plain text format and tagged Amharic text corpus encoded with extensible markup language format. The tag sets of annotated text corpora are represented in both Ge'ez script and English characters. We plan to increase the number of tag sets and size of text corpus in the near future. Moreover, we are working towards converting the semi-automatic Amharic text tagger to full automation.
引用
收藏
页码:77 / 82
页数:6
相关论文
共 50 条
  • [1] Parts of Speech Tagging of Romanized Sindhi Text by applying Rule Based Model
    Sodhar, Irum Naz
    Jalbani, Akhtar Hussain
    Channa, Muhammad Ibrahim
    Hakro, Dil Nawaz
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2019, 19 (11): : 91 - 96
  • [2] Corpus based part-of-speech tagging
    Lv, Chengyao
    Liu, Huihua
    Dong, Yuanxing
    Chen, Yunliang
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2016, 19 (03) : 647 - 654
  • [3] Part of Speech Tagging - A Corpus Based Approach
    Rashmi, S.
    Hanumanthappa, M.
    [J]. SMART TRENDS IN INFORMATION TECHNOLOGY AND COMPUTER COMMUNICATIONS, SMARTCOM 2016, 2016, 628 : 88 - 96
  • [4] Development of a pediatric text-corpus for part-of-speech tagging
    Pestian, J
    Itert, L
    Duch, W
    [J]. INTELLIGENT INFORMATION PROCESSING AND WEB MINING, 2004, : 219 - 226
  • [5] Parts of Speech tagging mechanism to unravel positive and negative patterns in an unstructured text document
    Roshan, D.
    Reddy, T. Hanumantha
    [J]. PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTATIONAL TECHNIQUES, ELECTRONICS AND MECHANICAL SYSTEMS (CTEMS), 2018, : 1 - 6
  • [6] PARTS OF SPEECH TAGGING FOR KONKANI LANGUAGE
    Khorjuvenkar, Diksha N. Prabhu
    Ainapurkar, Megha
    Chagas, Sufola
    [J]. PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTING METHODOLOGIES AND COMMUNICATION (ICCMC 2018), 2018, : 605 - 607
  • [7] A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text
    Xiong, Ying
    Wang, Zhongmin
    Jiang, Dehuan
    Wang, Xiaolong
    Chen, Qingcai
    Xu, Hua
    Yan, Jun
    Tang, Buzhou
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2019, 19 (Suppl 2)
  • [8] A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text
    Ying Xiong
    Zhongmin Wang
    Dehuan Jiang
    Xiaolong Wang
    Qingcai Chen
    Hua Xu
    Jun Yan
    Buzhou Tang
    [J]. BMC Medical Informatics and Decision Making, 19
  • [9] Part of speech tagging for Arabic text based radial basis function
    Shahin, Osama R.
    El Rwelli, Rady
    [J]. JOURNAL OF DISCRETE MATHEMATICAL SCIENCES & CRYPTOGRAPHY, 2021, 24 (08): : 2443 - 2459
  • [10] Part of Speech Tagging for Romanian Text-to-Speech System
    Teodorescu, Lucian Radu
    Boldizsar, Razvan
    Ordean, Mihai
    Duma, Melania
    Detesan, Laura
    Ordean, Mihaela
    [J]. 13TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC 2011), 2012, : 153 - 159