A scalable parallel subspace clustering algorithm for massive data sets

被引:28
|
作者
Nagesh, HS [1 ]
Goil, S [1 ]
Choudhary, A [1 ]
机构
[1] Northwestern Univ, ECE Dept, Evanston, IL 60208 USA
关键词
D O I
10.1109/ICPP.2000.876164
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Clustering is a data mining problem which finds dense regions in a sparse multi-dimensional data set The attribute values and ranges of these regions characterize the clusters. Clustering algorithms need to scale with the data base size and also with the large dimensionality of the data set. Further, these algorithms need to explore the embedded clusters in a subspace of a high dimensional space. However, the time complexity of the algorithm to explore clusters in subspaces is exponential in the dimensionality of the data and is thus extremely compute intensive. Thus, parallelization is the choice for discovering clusters for large data sets. In this paper we present a scalable parallel subspace clustering algorithm which has both data and task parallelism embedded in it. We also formulate the technique of adaptive grids and present a truly un-supervised clustering algorithm requiring no use. inputs. Our implementation shows near linear speedups with negligible communication overheads. The use of adaptive grids results in two orders of magnitude improvement in the computation rime of our serial algorithm over current methods with much better quality of clustering. Performance,results on both real and synthetic data sets with very large number of dimensions on a 16 node IBM SP2 demonstrate our algorithm to be a practical and scalable clustering technique.
引用
收藏
页码:477 / 484
页数:8
相关论文
共 50 条
  • [1] PBIRCH: A scalable parallel clustering algorithm for incremental data
    Garg, Ashwani
    Mangla, Ashish
    Gupta, Neelima
    Bhatnagar, Vasudha
    [J]. 10TH INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2006, : 315 - +
  • [2] PARALLEL HOP: A SCALABLE HALO FINDER FOR MASSIVE COSMOLOGICAL DATA SETS
    Skory, Stephen
    Turk, Matthew J.
    Norman, Michael L.
    Coil, Alison L.
    [J]. ASTROPHYSICAL JOURNAL SUPPLEMENT SERIES, 2010, 191 (01): : 43 - 57
  • [3] A novel algorithm for fast and scalable subspace clustering of high-dimensional data
    Kaur A.
    Datta A.
    [J]. Journal of Big Data, 2015, 2 (01)
  • [4] Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics
    Olman, Victor
    Mao, Fenglou
    Wu, Hongwei
    Xu, Ying
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2009, 6 (02) : 344 - 352
  • [5] Scalable Bootstrap Clustering for Massive Data
    Wang, Haocheng
    Zhuang, Fuzhen
    Ao, Xiang
    He, Qing
    Shi, Zhongzhi
    [J]. 2014 15TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2014, : 123 - 128
  • [6] P-AutoClass: Scalable parallel clustering for mining large data sets
    Pizzuti, C
    Talia, D
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2003, 15 (03) : 629 - 641
  • [7] CLUS: Parallel Subspace Clustering Algorithm on Spark
    Zhu, Bo
    Mara, Alexandru
    Mozo, Alberto
    [J]. NEW TRENDS IN DATABASES AND INFORMATION SYSTEMS (ADBIS 2015), 2015, 539 : 175 - 185
  • [8] A Scalable Exemplar-Based Subspace Clustering Algorithm for Class-Imbalanced Data
    You, Chong
    Li, Chi
    Robinson, Daniel P.
    Vidal, Rene
    [J]. COMPUTER VISION - ECCV 2018, PT IX, 2018, 11213 : 68 - 85
  • [9] Patch clustering for massive data sets
    Alex, Nikolai
    Hasenfuss, Alexander
    Hammer, Barbara
    [J]. NEUROCOMPUTING, 2009, 72 (7-9) : 1455 - 1469
  • [10] Parallel Clustering Algorithm for Large-Scale Biological Data Sets
    Wang, Minchao
    Zhang, Wu
    Ding, Wang
    Dai, Dongbo
    Zhang, Huiran
    Xie, Hao
    Chen, Luonan
    Guo, Yike
    Xie, Jiang
    [J]. PLOS ONE, 2014, 9 (04):