Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents

被引:49
|
作者
Alhawarat, M. [1 ]
Hegazi, M. [1 ]
机构
[1] Prince Sattam Bin Abdulaziz Univ, Dept Comp Sci, Al Kharj 11942, Saudi Arabia
来源
IEEE ACCESS | 2018年 / 6卷
关键词
Clustering text documents; K-means; Arabic language; topic modeling; latent Dirichlet allocation (LDA); TEXTS;
D O I
10.1109/ACCESS.2018.2852648
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering Arabic text documents is of high importance for many natural language technologies. This paper uses a combined method to cluster Arabic text documents. Mainly, we use generative models and clustering techniques. The study uses latent Dirichlet allocation and k-means clustering algorithm and applies them to a news data set used in previous similar studies. The aim of this paper is twofold: it first shows that normalizing the weights in the vector space, for the document-term matrix of the text documents, dramatically improves the quality of clusters and hence the accuracy of clustering when using k-means algorithm. The results are compared to a recent study on clustering Arabic text documents. Second, it shows that the combined method is superior in terms of clustering quality for Arabic text documents according to external measures, such as purity, F-measure, entropy, accuracy, and other measures. It is shown in this paper that the purity of the combined method is 0.933 compared to 0.82 for k-means algorithm, and these figures are higher in comparison to a recent similar study. This is also confirmed by the other used validation measures. The correctness of the combined method is then confirmed using different Arabic data sets.
引用
收藏
页码:42740 / 42749
页数:10
相关论文
共 50 条
  • [1] Clustering with Probabilistic Topic Models on Arabic Texts: A Comparative Study of LDA and K-Means
    Kelaiaia, Abdessalem
    Merouani, Hayet
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2016, 13 (02) : 332 - 338
  • [2] Enhancing topic clustering for Arabic security news based on k-means and topic modelling
    Alharbi, Adel R.
    Hijji, Mohammad
    Aljaedi, Amer
    IET NETWORKS, 2021, 10 (06) : 278 - 294
  • [4] A Topic Modeling for Clustering Arabic Documents
    Alkhafaji, Doaa Wahhab
    Al-Rashid, Sura
    PROCEEDING OF 2021 2ND INFORMATION TECHNOLOGY TO ENHANCE E-LEARNING AND OTHER APPLICATION (IT-ELA 2021), 2021, : 76 - 81
  • [5] Topic Detection Based on K-means
    Zhang, Dan
    Li, Shengdong
    2011 INTERNATIONAL CONFERENCE ON ELECTRONICS, COMMUNICATIONS AND CONTROL (ICECC), 2011, : 2983 - 2985
  • [7] Refining spherical K-means for clustering documents
    Peng, Jiming
    Zhu, Jiaping
    2006 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORK PROCEEDINGS, VOLS 1-10, 2006, : 4146 - +
  • [8] k-means Cluster Shape Implications
    Klopotek, Mieczyslaw A.
    Wierzchon, Slawomir T.
    Klopotek, Robert A.
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2020, PT I, 2020, 583 : 107 - 118
  • [9] Faster K-Means Cluster Estimation
    Khandelwal, Siddhesh
    Awekar, Amit
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2017, 2017, 10193 : 520 - 526
  • [10] Newsgroup topic extraction using term-cluster weighting and Pillar K-Means clustering
    Adinugroho S.
    Wihandika R.C.
    Adikara P.P.
    International Journal of Computers and Applications, 2022, 44 (04) : 357 - 364