Parallel Hierarchical Subspace Clustering of Categorical Data

被引:12
|
作者
Pang, Ning [1 ]
Zhang, Jifu [1 ]
Zhang, Chaowei [2 ]
Qin, Xiao [2 ]
机构
[1] Taiyuan Univ Sci & Technol TYUST, Taiyuan 030024, Shanxi, Peoples R China
[2] Auburn Univ, Samuel Ginn Coll Engn, Dept Comp Sci & Software Engn, Auburn, AL 36849 USA
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Hierarchical subspace-clustering; LSH-based data partitioning; categorical data; Hadoop; MAPREDUCE;
D O I
10.1109/TC.2018.2879332
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Parallel clustering is an important research area of big data analysis. The conventional Hierarchical Agglomerative Clustering (HAC) techniques are inadequate to handle big-scale categorical datasets due to two drawbacks. First, HAC consumes excessive CPU time and memory resources; and second, it is non-trivial to decompose clustering tasks into independent sub-tasks executed in parallel. We solve these two problems by a MapReduce-based hierarchical subspace-clustering algorithm - called PAPU - using LSH-based data partitioning. PAPU is conducive to partitioning a large-scale dataset into multiple independent sub-datasets, into which similar data objects are mapped. Advocating parallel computing, PAPU obtains sub-clusters corresponding to respective attribute subspaces from independent chunks in the local clustering phase. To improve the accuracy of approximated clustering results, PAPU measures various scale clusters by applying the hierarchical clustering scheme to iteratively merge sub-clusters during the global clustering phase. We implement PAPU on a 24-node Hadoop computing platform. The experimental results reveal that hierarchical subspace-clustering coupled with the data-partitioning strategy achieves high clustering efficiency on both synthetic and real-world large-scale datasets. The experiments also demonstrate that PAPU delivers superior performance in terms of extensibility and scalability (e.g., a nearly linear speedup).
引用
收藏
页码:542 / 555
页数:14
相关论文
共 50 条
  • [21] A weighting k-modes algorithm for subspace clustering of categorical data
    Cao, Fuyuan
    Liang, Jiye
    Li, Deyu
    Zhao, Xingwang
    [J]. NEUROCOMPUTING, 2013, 108 : 23 - 30
  • [22] A Subspace Clustering Algorithm of Categorical Data Using Multiple Attribute Weights
    [J]. Zhang, Ji-Fu (jifuzh@sina.com), 2018, Science Press (44):
  • [23] A Hierarchical Clustering for Categorical Data Based on Holo-entropy
    Sun, Haojun
    Chen, Rongbo
    Jin, Shulin
    Qin, Yong
    [J]. 2015 12TH WEB INFORMATION SYSTEM AND APPLICATION CONFERENCE (WISA), 2015, : 269 - 274
  • [24] Comparison of internal evaluation criteria in hierarchical clustering of categorical data
    Sulc, Zdenek
    Hornicek, Jaroslav
    Rezankova, Hana
    Cibulkova, Jana
    [J]. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2024,
  • [25] Hierarchical density-based clustering of categorical data and a simplification
    Andreopoulos, Bill
    An, Aijun
    Wang, Xiaogang
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2007, 4426 : 11 - +
  • [26] Holo-Entropy Based Categorical Data Hierarchical Clustering
    Sun, Haojun
    Chen, Rongbo
    Qin, Yong
    Wang, Shengrui
    [J]. INFORMATICA, 2017, 28 (02) : 303 - 328
  • [27] Parallel Hierarchical Agglomerative Clustering for fMRI Data
    Angeletti, Melodie
    Bonny, Jean-Marie
    Durif, Franck
    Koko, Jonas
    [J]. PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2017), PT I, 2018, 10777 : 265 - 275
  • [28] A Parallel data preprocessing algorithm for hierarchical clustering
    Li Zhao-Peng
    Li Zhao-jian
    [J]. 2013 FIFTH INTERNATIONAL CONFERENCE ON MEASURING TECHNOLOGY AND MECHATRONICS AUTOMATION (ICMTMA 2013), 2013, : 70 - 73
  • [29] A Categorical Data Clustering Algorithm and Its Efficient Parallel Implementation
    Ding, Xiangwu
    Tan, Jia
    Wang, Mei
    [J]. PROCEEDINGS OF 2016 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT), 2016, : 224 - 228
  • [30] A scalable parallel subspace clustering algorithm for massive data sets
    Nagesh, HS
    Goil, S
    Choudhary, A
    [J]. 2000 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS, 2000, : 477 - 484