Extraction of Proper Names from Myanmar Text Using Latent Dirichlet Allocation

被引:0
|
作者
Win, Yuzana [1 ]
Masada, Tomonari [1 ]
机构
[1] Nagasaki Univ, Grad Sch Engn, Nagasaki, Japan
关键词
LDA; LSI; rule-based; K-means clustering;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a method for proper names extraction from Myanmar text by using latent Dirichlet allocation (LDA). Our method aims to extract proper names that provide important information on the contents of Myanmar text. Our method consists of two steps. In the first step, we extract topic words from Myanmar news articles by using LDA. In the second step, we make a post-processing, because the resulting topic words contain some noisy words. Our post-processing, first of all, eliminates the topic words whose prefixes are Myanmar digits and suffixes are noun and verb particles. We then remove the duplicate words and discard the topic words that are contained in the existing dictionary. Consequently, we obtain the words as candidate of proper names, namely personal names, geographical names, unique object names, organization names, single event names, and so on. The evaluation is performed both from the subjective and quantitative perspectives. From the subjective perspective, we compare the accuracy of proper names extracted by our method with those extracted by latent semantic indexing (LSI) and rule-based method. It is shown that both LSI and our method can improve the accuracy of those obtained by rule-based method. However, our method can provide more interesting proper names than LSI. From the quantitative perspective, we use the extracted proper names as additional features in K-means clustering. The experimental results show that the document clusters given by our method are better than those given by LSI and rule-based method in precision, recall and F-score.
引用
收藏
页码:96 / 103
页数:8
相关论文
共 50 条
  • [31] Using Latent Dirichlet Allocation for Automatic Categorization of Software
    Tian, Kai
    Revelle, Meghan
    Poshyvanyk, Denys
    2009 6TH IEEE INTERNATIONAL WORKING CONFERENCE ON MINING SOFTWARE REPOSITORIES, 2009, : 163 - 166
  • [32] A Review of Cyberattack Research using Latent Dirichlet Allocation
    Xiao, Ming
    Dhillon, Gurpreet
    Smith, Kane J.
    28th Americas Conference on Information Systems, AMCIS 2022, 2022,
  • [33] Topic Modeling Using Latent Dirichlet allocation: A Survey
    Chauhan, Uttam
    Shah, Apurva
    ACM COMPUTING SURVEYS, 2021, 54 (07)
  • [34] A Comparative Automated Text Analysis of Airbnb Reviews in Hong Kong and Singapore Using Latent Dirichlet Allocation
    Kiatkawsin, Kiattipoom
    Sutherland, Ian
    Kim, Jin-Young
    SUSTAINABILITY, 2020, 12 (16)
  • [35] Unsupervised Language Filtering using the Latent Dirichlet Allocation
    Zhang, Wei
    Clark, Robert A. J.
    Wang, Yongyuan
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1268 - 1272
  • [36] Predicting Component Failures Using Latent Dirichlet Allocation
    Liu, Hailin
    Xu, Ling
    Yang, Mengning
    Yan, Meng
    Zhang, Xiaohong
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2015, 2015
  • [37] Land cover harmonization using Latent Dirichlet Allocation
    Li, Zhan
    White, Joanne C.
    Wulder, Michael A.
    Hermosilla, Txomin
    Davidson, Andrew M.
    Comber, Alexis J.
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2021, 35 (02) : 348 - 374
  • [38] Using Latent Dirichlet Allocation for Topic Modelling in Twitter
    Ostrowski, David Alfred
    2015 IEEE 9TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2015, : 493 - 497
  • [39] An overview of literature on COVID-19, MERS and SARS: Using text mining and latent Dirichlet allocation
    Cheng, Xian
    Cao, Qiang
    Liao, Stephen Shaoyi
    JOURNAL OF INFORMATION SCIENCE, 2022, 48 (03) : 304 - 320
  • [40] Abusive Text Examination Using Latent Dirichlet Allocation, Self Organizing Maps and K Means Clustering
    Saini, Yash
    Bachchas, Vishal
    Kumar, Yogesh
    Kumar, Sanjay
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS 2020), 2020, : 1233 - 1238