Parallel Hierarchical Subspace Clustering of Categorical Data

被引:12
|
作者
Pang, Ning [1 ]
Zhang, Jifu [1 ]
Zhang, Chaowei [2 ]
Qin, Xiao [2 ]
机构
[1] Taiyuan Univ Sci & Technol TYUST, Taiyuan 030024, Shanxi, Peoples R China
[2] Auburn Univ, Samuel Ginn Coll Engn, Dept Comp Sci & Software Engn, Auburn, AL 36849 USA
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Hierarchical subspace-clustering; LSH-based data partitioning; categorical data; Hadoop; MAPREDUCE;
D O I
10.1109/TC.2018.2879332
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Parallel clustering is an important research area of big data analysis. The conventional Hierarchical Agglomerative Clustering (HAC) techniques are inadequate to handle big-scale categorical datasets due to two drawbacks. First, HAC consumes excessive CPU time and memory resources; and second, it is non-trivial to decompose clustering tasks into independent sub-tasks executed in parallel. We solve these two problems by a MapReduce-based hierarchical subspace-clustering algorithm - called PAPU - using LSH-based data partitioning. PAPU is conducive to partitioning a large-scale dataset into multiple independent sub-datasets, into which similar data objects are mapped. Advocating parallel computing, PAPU obtains sub-clusters corresponding to respective attribute subspaces from independent chunks in the local clustering phase. To improve the accuracy of approximated clustering results, PAPU measures various scale clusters by applying the hierarchical clustering scheme to iteratively merge sub-clusters during the global clustering phase. We implement PAPU on a 24-node Hadoop computing platform. The experimental results reveal that hierarchical subspace-clustering coupled with the data-partitioning strategy achieves high clustering efficiency on both synthetic and real-world large-scale datasets. The experiments also demonstrate that PAPU delivers superior performance in terms of extensibility and scalability (e.g., a nearly linear speedup).
引用
收藏
页码:542 / 555
页数:14
相关论文
共 50 条
  • [1] A subspace hierarchical clustering algorithm for categorical data
    Carbonera, Joel Luis
    Abel, Mara
    [J]. 2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 509 - 516
  • [2] Kernel Subspace Clustering Algorithm for Categorical Data
    Xu, Kun-Peng
    Chen, Li-Fei
    Sun, Hao-Jun
    Wang, Bei-Zhan
    [J]. Ruan Jian Xue Bao/Journal of Software, 2020, 31 (11): : 3492 - 3505
  • [3] Subspace Clustering with Feature Grouping for Categorical Data
    Jia, Hong
    Dong, Menghan
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, KSEM 2023, 2023, 14117 : 247 - 254
  • [4] PUMA: Parallel subspace clustering of categorical data using multi-attribute weights
    Pang, Ning
    Zhang, Jifu
    Zhang, Chaowei
    Qin, Xiao
    Cai, Jianghui
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 126 : 233 - 245
  • [5] Ordering of categorical data in hierarchical clustering
    Kazimianec, Michail
    [J]. DATABASES AND INFORMATION SYSTEMS, 2008, : 401 - 404
  • [6] Soft subspace clustering of categorical data with probabilistic distance
    Chen, Lifei
    Wang, Shengrui
    Wang, Kaijun
    Zhu, Jianping
    [J]. PATTERN RECOGNITION, 2016, 51 : 322 - 332
  • [7] Hierarchical division clustering framework for categorical data
    Wei, Wei
    Liang, Jiye
    Guo, Xinyao
    Song, Peng
    Sun, Yijun
    [J]. NEUROCOMPUTING, 2019, 341 : 118 - 134
  • [8] A hierarchical clustering algorithm for categorical sequence data
    Oh, SJ
    Kim, JY
    [J]. INFORMATION PROCESSING LETTERS, 2004, 91 (03) : 135 - 140
  • [9] DHCC: Divisive hierarchical clustering of categorical data
    Xiong, Tengke
    Wang, Shengrui
    Mayers, Andre
    Monga, Ernest
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2012, 24 (01) : 103 - 135
  • [10] DHCC: Divisive hierarchical clustering of categorical data
    Tengke Xiong
    Shengrui Wang
    André Mayers
    Ernest Monga
    [J]. Data Mining and Knowledge Discovery, 2012, 24 : 103 - 135