Parallel Hierarchical Subspace Clustering of Categorical Data

被引：12

作者：

Pang, Ning ^{[1
]}

Zhang, Jifu ^{[1
]}

Zhang, Chaowei ^{[2
]}

Qin, Xiao ^{[2
]}

机构：

[1] Taiyuan Univ Sci & Technol TYUST, Taiyuan 030024, Shanxi, Peoples R China

[2] Auburn Univ, Samuel Ginn Coll Engn, Dept Comp Sci & Software Engn, Auburn, AL 36849 USA

来源：

IEEE TRANSACTIONS ON COMPUTERS | 2019年 / 68卷 / 04期

基金：

美国国家科学基金会; 中国国家自然科学基金;

关键词：

Hierarchical subspace-clustering; LSH-based data partitioning; categorical data; Hadoop; MAPREDUCE;

D O I：

10.1109/TC.2018.2879332

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Parallel clustering is an important research area of big data analysis. The conventional Hierarchical Agglomerative Clustering (HAC) techniques are inadequate to handle big-scale categorical datasets due to two drawbacks. First, HAC consumes excessive CPU time and memory resources; and second, it is non-trivial to decompose clustering tasks into independent sub-tasks executed in parallel. We solve these two problems by a MapReduce-based hierarchical subspace-clustering algorithm - called PAPU - using LSH-based data partitioning. PAPU is conducive to partitioning a large-scale dataset into multiple independent sub-datasets, into which similar data objects are mapped. Advocating parallel computing, PAPU obtains sub-clusters corresponding to respective attribute subspaces from independent chunks in the local clustering phase. To improve the accuracy of approximated clustering results, PAPU measures various scale clusters by applying the hierarchical clustering scheme to iteratively merge sub-clusters during the global clustering phase. We implement PAPU on a 24-node Hadoop computing platform. The experimental results reveal that hierarchical subspace-clustering coupled with the data-partitioning strategy achieves high clustering efficiency on both synthetic and real-world large-scale datasets. The experiments also demonstrate that PAPU delivers superior performance in terms of extensibility and scalability (e.g., a nearly linear speedup).

引用

页码：542 / 555

页数：14

共 50 条

[1] A subspace hierarchical clustering algorithm for categorical data
Carbonera, Joel Luis
Abel, Mara
[J]. 2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 509 - 516
[2] Kernel Subspace Clustering Algorithm for Categorical Data
Xu, Kun-Peng
Chen, Li-Fei
Sun, Hao-Jun
Wang, Bei-Zhan
[J]. Ruan Jian Xue Bao/Journal of Software, 2020, 31 (11): : 3492 - 3505
[3] Subspace Clustering with Feature Grouping for Categorical Data
Jia, Hong
Dong, Menghan
[J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, KSEM 2023, 2023, 14117 : 247 - 254
[4] PUMA: Parallel subspace clustering of categorical data using multi-attribute weights
Pang, Ning
Zhang, Jifu
Zhang, Chaowei
Qin, Xiao
Cai, Jianghui
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 126 : 233 - 245
[5] Ordering of categorical data in hierarchical clustering
Kazimianec, Michail
[J]. DATABASES AND INFORMATION SYSTEMS, 2008, : 401 - 404
[6] Soft subspace clustering of categorical data with probabilistic distance
Chen, Lifei
Wang, Shengrui
Wang, Kaijun
Zhu, Jianping
[J]. PATTERN RECOGNITION, 2016, 51 : 322 - 332
[7] Hierarchical division clustering framework for categorical data
Wei, Wei
Liang, Jiye
Guo, Xinyao
Song, Peng
Sun, Yijun
[J]. NEUROCOMPUTING, 2019, 341 : 118 - 134
[8] A hierarchical clustering algorithm for categorical sequence data
Oh, SJ
Kim, JY
[J]. INFORMATION PROCESSING LETTERS, 2004, 91 (03) : 135 - 140
[9] DHCC: Divisive hierarchical clustering of categorical data
Xiong, Tengke
Wang, Shengrui
Mayers, Andre
Monga, Ernest
[J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2012, 24 (01) : 103 - 135
[10] DHCC: Divisive hierarchical clustering of categorical data
Tengke Xiong
Shengrui Wang
André Mayers
Ernest Monga
[J]. Data Mining and Knowledge Discovery, 2012, 24 : 103 - 135

← 1 2 3 4 5 →