Extraction of Proper Names from Myanmar Text Using Latent Dirichlet Allocation

被引:0
|
作者
Win, Yuzana [1 ]
Masada, Tomonari [1 ]
机构
[1] Nagasaki Univ, Grad Sch Engn, Nagasaki, Japan
关键词
LDA; LSI; rule-based; K-means clustering;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a method for proper names extraction from Myanmar text by using latent Dirichlet allocation (LDA). Our method aims to extract proper names that provide important information on the contents of Myanmar text. Our method consists of two steps. In the first step, we extract topic words from Myanmar news articles by using LDA. In the second step, we make a post-processing, because the resulting topic words contain some noisy words. Our post-processing, first of all, eliminates the topic words whose prefixes are Myanmar digits and suffixes are noun and verb particles. We then remove the duplicate words and discard the topic words that are contained in the existing dictionary. Consequently, we obtain the words as candidate of proper names, namely personal names, geographical names, unique object names, organization names, single event names, and so on. The evaluation is performed both from the subjective and quantitative perspectives. From the subjective perspective, we compare the accuracy of proper names extracted by our method with those extracted by latent semantic indexing (LSI) and rule-based method. It is shown that both LSI and our method can improve the accuracy of those obtained by rule-based method. However, our method can provide more interesting proper names than LSI. From the quantitative perspective, we use the extracted proper names as additional features in K-means clustering. The experimental results show that the document clusters given by our method are better than those given by LSI and rule-based method in precision, recall and F-score.
引用
收藏
页码:96 / 103
页数:8
相关论文
共 50 条
  • [1] Feature extraction for document text using Latent Dirichlet Allocation
    Prihatini, P. M.
    Suryawan, I. K.
    Mandia, I. N.
    2ND INTERNATIONAL JOINT CONFERENCE ON SCIENCE AND TECHNOLOGY (IJCST) 2017, 2018, 953
  • [2] EXTRACTION OF THEMES FROM AERIAL IMAGERY USING LATENT DIRICHLET ALLOCATION
    Deshpande, Shailesh
    Ladha, Shamsuddin
    Aggarwal, Hemant
    Yadav, Piyush
    2017 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2017, : 4770 - 4773
  • [3] Feature Substitution Using Latent Dirichlet Allocation for Text Classification
    Mathivanan, Norsyela Muhammad Noor
    Janor, Roziah Mohd
    Abd Razak, Shukor
    Ghani, Nor Azura Md.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2025, 16 (01) : 1087 - 1098
  • [4] BiModal Latent Dirichlet Allocation for Text and Image
    Liao, Xiaofeng
    Jiang, Qingshan
    Zhang, Wei
    Zhang, Kai
    2014 4TH IEEE INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (ICIST), 2014, : 736 - 739
  • [5] Indonesian Text Feature Extraction using Gibbs Sampling and Mean Variational Inference Latent Dirichlet Allocation
    Prihatini, P. M.
    Putra, I. K. G. D.
    Giriantari, I. A. D.
    Sudarma, M.
    2017 15TH INTERNATIONAL CONFERENCE ON QUALITY IN RESEARCH (QIR) - INTERNATIONAL SYMPOSIUM ON ELECTRICAL AND COMPUTER ENGINEERING, 2017, : 40 - 44
  • [6] Topic and Trend Detection in Text Collections Using Latent Dirichlet Allocation
    Bolelli, Levent
    Ertekin, Seyda
    Giles, C. Lee
    ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 776 - +
  • [7] Text Representation Using Multi-level Latent Dirichlet Allocation
    Razavi, Amir H.
    Inkpen, Diana
    ADVANCES IN ARTIFICIAL INTELLIGENCE, CANADIAN AI 2014, 2014, 8436 : 215 - 226
  • [8] Evaluation of text semantic features using latent dirichlet allocation model
    Zhou C.
    Li N.
    Zhang C.
    Yang X.
    International Journal of Performability Engineering, 2020, 16 (06) : 968 - 978
  • [9] Summarization of changes in dynamic text collections using Latent Dirichlet Allocation model
    Kar, Manika
    Nunes, Sergio
    Ribeiro, Cristina
    INFORMATION PROCESSING & MANAGEMENT, 2015, 51 (06) : 809 - 833
  • [10] Text data analysis using Latent Dirichlet Allocation: an application to FOMC transcripts
    Edison, Hali
    Carcel, Hector
    APPLIED ECONOMICS LETTERS, 2021, 28 (01) : 38 - 42