Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents

被引:49
|
作者
Alhawarat, M. [1 ]
Hegazi, M. [1 ]
机构
[1] Prince Sattam Bin Abdulaziz Univ, Dept Comp Sci, Al Kharj 11942, Saudi Arabia
来源
IEEE ACCESS | 2018年 / 6卷
关键词
Clustering text documents; K-means; Arabic language; topic modeling; latent Dirichlet allocation (LDA); TEXTS;
D O I
10.1109/ACCESS.2018.2852648
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering Arabic text documents is of high importance for many natural language technologies. This paper uses a combined method to cluster Arabic text documents. Mainly, we use generative models and clustering techniques. The study uses latent Dirichlet allocation and k-means clustering algorithm and applies them to a news data set used in previous similar studies. The aim of this paper is twofold: it first shows that normalizing the weights in the vector space, for the document-term matrix of the text documents, dramatically improves the quality of clusters and hence the accuracy of clustering when using k-means algorithm. The results are compared to a recent study on clustering Arabic text documents. Second, it shows that the combined method is superior in terms of clustering quality for Arabic text documents according to external measures, such as purity, F-measure, entropy, accuracy, and other measures. It is shown in this paper that the purity of the combined method is 0.933 compared to 0.82 for k-means algorithm, and these figures are higher in comparison to a recent similar study. This is also confirmed by the other used validation measures. The correctness of the combined method is then confirmed using different Arabic data sets.
引用
收藏
页码:42740 / 42749
页数:10
相关论文
共 50 条
  • [21] Initial Centroid Selection Optimization for K-Means with Genetic Algorithm to Enhance Clustering of Transcribed Arabic Broadcast News Documents
    Maghawry, Ahmed Mohamed
    Omar, Yasser
    Badr, Amr
    APPLIED COMPUTATIONAL INTELLIGENCE AND MATHEMATICAL METHODS: COMPUTATIONAL METHODS IN SYSTEMS AND SOFTWARE 2017, VOL. 2, 2018, 662 : 86 - 101
  • [22] Comparison Among Methods for k Estimation in k-means
    Naldi, Murilo C.
    Fontana, Andre
    Campello, Ricardo J. G. B.
    2009 9TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2009, : 1006 - 1013
  • [23] Multimorbidity patterns with K-means nonhierarchical cluster analysis
    Violan, Concepcion
    Roso-Llorach, Albert
    Foguet-Boreu, Quinti
    Guisado-Clavero, Marina
    Pons-Vigues, Mariona
    Pujol-Ribera, Enriqueta
    Valderas, Jose M.
    BMC FAMILY PRACTICE, 2018, 19
  • [24] Using K-Means Clustering to Cluster Provinces in Indonesia
    Ahmar, Ansari Saleh
    Napitupulu, Darmawan
    Rahim, Robbi
    Hidayat, Rahmat
    Sonatha, Yance
    Azmi, Meri
    2ND INTERNATIONAL CONFERENCE ON STATISTICS, MATHEMATICS, TEACHING, AND RESEARCH 2017, 2018, 1028
  • [25] On K-Means Cluster Preservation using Quantization Schemes
    Turaga, Deepak S.
    Vlachos, Michail
    Verscheure, Olivier
    2009 9TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2009, : 533 - +
  • [26] Multimorbidity patterns with K-means nonhierarchical cluster analysis
    Concepción Violán
    Albert Roso-Llorach
    Quintí Foguet-Boreu
    Marina Guisado-Clavero
    Mariona Pons-Vigués
    Enriqueta Pujol-Ribera
    Jose M. Valderas
    BMC Family Practice, 19
  • [27] Privacy Preservation in k-Means Clustering by Cluster Rotation
    Dhiraj, S. S. Shivaji
    Khan, Ameer M. Asif
    Khan, Wajhiulla
    Challagalla, Ajay
    TENCON 2009 - 2009 IEEE REGION 10 CONFERENCE, VOLS 1-4, 2009, : 1437 - 1443
  • [28] K-Means Method for Grouping in Hybrid MapReduce Cluster
    Yang, Yang
    Long, Xiang
    Jiang, Bo
    JOURNAL OF COMPUTERS, 2013, 8 (10) : 2648 - 2655
  • [29] MST-Based Cluster Initialization for K-Means
    Reddy, Damodar
    Mishra, Devender
    Jana, Prasanta K.
    ADVANCES IN COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, PT I, 2011, 131 : 329 - 338
  • [30] Enhancing the K-means Algorithm Using Cluster Adjustment
    Yamout, Fadi
    2023 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE, CSCI 2023, 2023, : 307 - 311