Application of Algorithm CARDBK in Document Clustering

被引:0
|
作者
ZHU Yehang [1 ]
ZHANG Mingjie [1 ]
SHI Feng [2 ]
机构
[1] College of Economics and Management, Xi'an University of Posts and Telecommunications
[2] Information Business Department, Puyang Technician College
关键词
algorithm design and analysis; clustering; document analysis; text processing;
D O I
暂无
中图分类号
TP301.6 [算法理论];
学科分类号
081202 ;
摘要
In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied results; local adjustment cannot save the clustering result from poor local optima. If there is an anomaly in a cluster, it will seriously affect the cluster mean value. The K-means clustering algorithm is only suitable for clusters with convex shapes. We therefore propose a novel clustering algorithm CARDBK—"centroid all rank distance(CARD)" which means that all centroids are sorted by distance value from one point and "BK" are the initials of "batch K-means"—in which one point not only modifies a cluster centroid nearest to this point but also modifies multiple clusters centroids adjacent to this point, and the degree of influence of a point on a cluster centroid depends on the distance value between this point and the other nearer cluster centroids. Experimental results showed that our CARDBK algorithm outperformed other algorithms when tested on a number of different data sets based on the following performance indexes: entropy, purity, F1 value, Rand index and normalized mutual information(NMI). Our algorithm manifested to be more stable, linearly scalable and faster.
引用
收藏
页码:514 / 524
页数:11
相关论文
共 50 条
  • [1] Application of Genetic Algorithm in Document Clustering
    Wei Jian-Xiang
    Liu Huai
    Sun Yue-hong
    Su Xin-Ning
    [J]. 2009 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE, VOL 1, PROCEEDINGS, 2009, : 145 - +
  • [2] Application of fuzzy clustering algorithm in Chinese document clustering
    Li, Jiafu
    Zhang, Yafei
    Lu, Jianjiang
    [J]. Jisuanji Gongcheng/Computer Engineering, 2002, 28 (04):
  • [3] An efficient document clustering algorithm and its application to a document browser
    Tanaka, H
    Kumano, T
    Uratani, N
    Ehara, T
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1999, 35 (04) : 541 - 557
  • [4] Improved Fuzzy Clustering Algorithm and Its Application in Document Clustering
    Liu Yiming
    Yao Min
    Zheng Xiaoliang
    [J]. PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT, VOLS A-C, 2008, : 2366 - 2370
  • [5] An improved bee colony optimization algorithm with an application to document clustering
    Forsati, Rana
    Keikha, Andisheh
    Shamsfard, Mehrnoush
    [J]. NEUROCOMPUTING, 2015, 159 : 9 - 26
  • [6] A NEURAL ALGORITHM FOR DOCUMENT CLUSTERING
    MACLEOD, KJ
    ROBERTSON, W
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1991, 27 (04) : 337 - 346
  • [7] Document clustering with hierarchical algorithm
    Wang, Y
    Hodges, J
    [J]. Proceedings of the 8th Joint Conference on Information Sciences, Vols 1-3, 2005, : 1614 - 1617
  • [8] An Overview of Clustering Models with an Application to Document Clustering
    Pauletic, Iva
    Nacinovic Prskalo, Lucia
    Bakaric, Marija Brkic
    [J]. 2019 42ND INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2019, : 1659 - 1664
  • [9] An improved clustering algorithm for web document
    Wang, Jing
    Liu, Zhijing
    [J]. Journal of Information and Computational Science, 2009, 6 (02): : 959 - 966
  • [10] An extended chameleon algorithm for document clustering
    AmritaVishwaVidyapeetham, Dept. of Computer Science and Application, India
    [J]. Adv. Intell. Sys. Comput., (335-348):