A Categorical Data Clustering Algorithm and Its Efficient Parallel Implementation

被引:0
|
作者
Ding, Xiangwu [1 ]
Tan, Jia [1 ]
Wang, Mei [1 ]
机构
[1] Donghua Univ, Coll Comp Sci & Technol, Shanghai, Peoples R China
关键词
categorical data; CLOPE; p-CLOPE; RW-CLOPE; spark;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
For large-scale, high-dimensional, sparse categorical data clustering, compared with the traditional clustering algorithm, CLOPE has a great improvement in the quality of clustering and running speed. However, CLOPE algorithm itself also has some defects in clustering quality stability and does not distinguish the attribute clustering contribution between dimensions, besides, it needs to specify rejection factor r in advance. Therefore, this paper proposes a clustering algorithm for categorical data based on random sequence iteration and attribute weight (RW-CLOPE). RW-CLOPE uses the "shuffle" model to sort the raw data randomly to eliminates the effect of data input sequence on clustering quality. At the same time, based on the attribute entropy, the calculation method of attribute weights is proposed to distinguish the attribute clustering contribution of each dimensions, which is greatly improves the quality of data clustering. Finally, the RW-CLOPE algorithm has been implemented on the efficient cluster platform(Spark). Experiments on two different and real databases show that RW-CLOPE algorithm achieves better clustering quality than p-CLOPE algorithm when the number of datasets is the same. For the mushrooms dataset, when CLOPE obtains thebest results, RW-CLOPE can achieve 68% larger profit value than CLOPE and25% larger profit value than p-CLOPE. The execution time of RW-CLOPE algorithm is much shorter than p-CLOPE algorithm when dealing with massive data. When has enough computing resource,the more shuffle copies of data the more obvious the improvement of the execution time.
引用
收藏
页码:224 / 228
页数:5
相关论文
共 50 条
  • [1] A parallel clustering algorithm for categorical data set
    Wang, YX
    Wang, ZH
    Li, XM
    [J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING - ICAISC 2004, 2004, 3070 : 928 - 933
  • [2] Squeezer: An efficient algorithm for clustering categorical data
    He, ZY
    Xu, XF
    Deng, SC
    [J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2002, 17 (05) : 611 - 624
  • [3] Squeezer: An efficient algorithm for clustering categorical data
    Zengyou He
    Xiaofei Xu
    Shengchun Deng
    [J]. Journal of Computer Science and Technology, 2002, 17 : 611 - 624
  • [4] Performances of parallel clustering algorithm for categorical and mixed data
    Hai, NTM
    Susumu, H
    [J]. PARALLEL AND DISTRIBUTED COMPUTING: APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2004, 3320 : 252 - 256
  • [5] HABOS clustering algorithm for categorical data
    [J]. Wu, Sen (wusen@manage.ustb.edu.cn), 2016, Science Press (38):
  • [6] Clustering algorithm for Boolean and categorical data
    [J]. 2001, Huazhong University of Science and Technology (29):
  • [7] Parallel Hierarchical Subspace Clustering of Categorical Data
    Pang, Ning
    Zhang, Jifu
    Zhang, Chaowei
    Qin, Xiao
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2019, 68 (04) : 542 - 555
  • [8] THUS: An Efficient Two-stage Hierarchical Algorithm for Categorical Data Clustering
    Gao, Xuedong
    Yang, Minghan
    Wei, Guiying
    [J]. 2018 8TH INTERNATIONAL CONFERENCE ON LOGISTICS, INFORMATICS AND SERVICE SCIENCES (LISS), 2018,
  • [9] Kernel Subspace Clustering Algorithm for Categorical Data
    Xu, Kun-Peng
    Chen, Li-Fei
    Sun, Hao-Jun
    Wang, Bei-Zhan
    [J]. Ruan Jian Xue Bao/Journal of Software, 2020, 31 (11): : 3492 - 3505
  • [10] A hierarchical clustering algorithm for categorical sequence data
    Oh, SJ
    Kim, JY
    [J]. INFORMATION PROCESSING LETTERS, 2004, 91 (03) : 135 - 140