Application of Algorithm CARDBK in Document Clustering

被引：0

作者：

ZHU Yehang ^{[1
]}

ZHANG Mingjie ^{[1
]}

SHI Feng ^{[2
]}

机构：

[1] College of Economics and Management, Xi'an University of Posts and Telecommunications

[2] Information Business Department, Puyang Technician College

来源：

Wuhan University Journal of Natural Sciences | 2018年 / 23卷 / 06期

关键词：

algorithm design and analysis; clustering; document analysis; text processing;

D O I：

暂无

中图分类号：

TP301.6 [算法理论];

学科分类号：

081202 ;

摘要：

In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied results; local adjustment cannot save the clustering result from poor local optima. If there is an anomaly in a cluster, it will seriously affect the cluster mean value. The K-means clustering algorithm is only suitable for clusters with convex shapes. We therefore propose a novel clustering algorithm CARDBK—"centroid all rank distance(CARD)" which means that all centroids are sorted by distance value from one point and "BK" are the initials of "batch K-means"—in which one point not only modifies a cluster centroid nearest to this point but also modifies multiple clusters centroids adjacent to this point, and the degree of influence of a point on a cluster centroid depends on the distance value between this point and the other nearer cluster centroids. Experimental results showed that our CARDBK algorithm outperformed other algorithms when tested on a number of different data sets based on the following performance indexes: entropy, purity, F1 value, Rand index and normalized mutual information(NMI). Our algorithm manifested to be more stable, linearly scalable and faster.

引用

页码：514 / 524

页数：11

共 50 条

[1] Application of Genetic Algorithm in Document Clustering
Wei Jian-Xiang
Liu Huai
Sun Yue-hong
Su Xin-Ning
[J]. 2009 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE, VOL 1, PROCEEDINGS, 2009, : 145 - +
[2] Application of fuzzy clustering algorithm in Chinese document clustering
Li, Jiafu
Zhang, Yafei
Lu, Jianjiang
[J]. Jisuanji Gongcheng/Computer Engineering, 2002, 28 (04):
[3] An efficient document clustering algorithm and its application to a document browser
Tanaka, H
Kumano, T
Uratani, N
Ehara, T
[J]. INFORMATION PROCESSING & MANAGEMENT, 1999, 35 (04) : 541 - 557
[4] Improved Fuzzy Clustering Algorithm and Its Application in Document Clustering
Liu Yiming
Yao Min
Zheng Xiaoliang
[J]. PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT, VOLS A-C, 2008, : 2366 - 2370
[5] An improved bee colony optimization algorithm with an application to document clustering
Forsati, Rana
Keikha, Andisheh
Shamsfard, Mehrnoush
[J]. NEUROCOMPUTING, 2015, 159 : 9 - 26
[6] A NEURAL ALGORITHM FOR DOCUMENT CLUSTERING
MACLEOD, KJ
ROBERTSON, W
[J]. INFORMATION PROCESSING & MANAGEMENT, 1991, 27 (04) : 337 - 346
[7] Document clustering with hierarchical algorithm
Wang, Y
Hodges, J
[J]. Proceedings of the 8th Joint Conference on Information Sciences, Vols 1-3, 2005, : 1614 - 1617
[8] An Overview of Clustering Models with an Application to Document Clustering
Pauletic, Iva
Nacinovic Prskalo, Lucia
Bakaric, Marija Brkic
[J]. 2019 42ND INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2019, : 1659 - 1664
[9] An improved clustering algorithm for web document
Wang, Jing
Liu, Zhijing
[J]. Journal of Information and Computational Science, 2009, 6 (02): : 959 - 966
[10] An extended chameleon algorithm for document clustering
AmritaVishwaVidyapeetham, Dept. of Computer Science and Application, India
[J]. Adv. Intell. Sys. Comput., (335-348):

← 1 2 3 4 5 →