A Categorical Data Clustering Algorithm and Its Efficient Parallel Implementation

被引：0

作者：

Ding, Xiangwu ^{[1
]}

Tan, Jia ^{[1
]}

Wang, Mei ^{[1
]}

机构：

[1] Donghua Univ, Coll Comp Sci & Technol, Shanghai, Peoples R China

来源：

PROCEEDINGS OF 2016 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT) | 2016年

关键词：

categorical data; CLOPE; p-CLOPE; RW-CLOPE; spark;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

For large-scale, high-dimensional, sparse categorical data clustering, compared with the traditional clustering algorithm, CLOPE has a great improvement in the quality of clustering and running speed. However, CLOPE algorithm itself also has some defects in clustering quality stability and does not distinguish the attribute clustering contribution between dimensions, besides, it needs to specify rejection factor r in advance. Therefore, this paper proposes a clustering algorithm for categorical data based on random sequence iteration and attribute weight (RW-CLOPE). RW-CLOPE uses the "shuffle" model to sort the raw data randomly to eliminates the effect of data input sequence on clustering quality. At the same time, based on the attribute entropy, the calculation method of attribute weights is proposed to distinguish the attribute clustering contribution of each dimensions, which is greatly improves the quality of data clustering. Finally, the RW-CLOPE algorithm has been implemented on the efficient cluster platform(Spark). Experiments on two different and real databases show that RW-CLOPE algorithm achieves better clustering quality than p-CLOPE algorithm when the number of datasets is the same. For the mushrooms dataset, when CLOPE obtains thebest results, RW-CLOPE can achieve 68% larger profit value than CLOPE and25% larger profit value than p-CLOPE. The execution time of RW-CLOPE algorithm is much shorter than p-CLOPE algorithm when dealing with massive data. When has enough computing resource,the more shuffle copies of data the more obvious the improvement of the execution time.

引用

页码：224 / 228

页数：5

共 50 条

[1] A parallel clustering algorithm for categorical data set
Wang, YX
Wang, ZH
Li, XM
[J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING - ICAISC 2004, 2004, 3070 : 928 - 933
[2] Squeezer: An efficient algorithm for clustering categorical data
He, ZY
Xu, XF
Deng, SC
[J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2002, 17 (05) : 611 - 624
[3] Squeezer: An efficient algorithm for clustering categorical data
Zengyou He
Xiaofei Xu
Shengchun Deng
[J]. Journal of Computer Science and Technology, 2002, 17 : 611 - 624
[4] Performances of parallel clustering algorithm for categorical and mixed data
Hai, NTM
Susumu, H
[J]. PARALLEL AND DISTRIBUTED COMPUTING: APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2004, 3320 : 252 - 256
[5] HABOS clustering algorithm for categorical data
[J]. Wu, Sen (wusen@manage.ustb.edu.cn), 2016, Science Press (38):
[6] Clustering algorithm for Boolean and categorical data
[J]. 2001, Huazhong University of Science and Technology (29):
[7] Parallel Hierarchical Subspace Clustering of Categorical Data
Pang, Ning
Zhang, Jifu
Zhang, Chaowei
Qin, Xiao
[J]. IEEE TRANSACTIONS ON COMPUTERS, 2019, 68 (04) : 542 - 555
[8] THUS: An Efficient Two-stage Hierarchical Algorithm for Categorical Data Clustering
Gao, Xuedong
Yang, Minghan
Wei, Guiying
[J]. 2018 8TH INTERNATIONAL CONFERENCE ON LOGISTICS, INFORMATICS AND SERVICE SCIENCES (LISS), 2018,
[9] Kernel Subspace Clustering Algorithm for Categorical Data
Xu, Kun-Peng
Chen, Li-Fei
Sun, Hao-Jun
Wang, Bei-Zhan
[J]. Ruan Jian Xue Bao/Journal of Software, 2020, 31 (11): : 3492 - 3505
[10] A hierarchical clustering algorithm for categorical sequence data
Oh, SJ
Kim, JY
[J]. INFORMATION PROCESSING LETTERS, 2004, 91 (03) : 135 - 140

← 1 2 3 4 5 →