Local-Density Subspace Distributed Clustering for High-Dimensional Data

被引:7
|
作者
Geng, Yangli-ao [1 ]
Li, Qingyong [1 ]
Liang, Mingfei [2 ]
Chi, Chong-Yung [3 ]
Tan, Juan [4 ]
Huang, Heng [5 ,6 ]
机构
[1] Beijing Jiaotong Univ, Beijing Key Lab Transportat Data Anal & Min, Beijing 100044, Peoples R China
[2] Tencent Co Ltd, WeiXin Grp, Beijing 100044, Peoples R China
[3] Natl Tsing Hua Univ, Inst Commun Engn, Hsinchu 30013, Taiwan
[4] Beijing Technol & Business Univ, Dept Business Adm, Beijing 100048, Peoples R China
[5] Univ Pittsburgh, Dept Elect & Comp Engn, Pittsburgh, PA 15260 USA
[6] JD Finance Amer Corp, Mountain View, CA USA
基金
北京市自然科学基金;
关键词
Clustering algorithms; Distributed databases; Principal component analysis; Data models; Clustering methods; Big Data; Kernel; High-dimensional clustering; distributed clustering; density-base clustering; subspace Gaussian model; ALGORITHM;
D O I
10.1109/TPDS.2020.2975550
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Distributed clustering is emerging along with the advent of the era of big data. However, most existing established distributed clustering methods focus on problems caused by a large amount of data rather than caused by the large dimension of data. Consequently, they suffer the "curse" of dimensionality (e.g., poor performance and heavy network overhead) when high-dimensional (HD) data are clustered. In this article, we propose a distributed algorithm, referred to as Local Density Subspace Distributed Clustering (LDSDC) algorithm, to cluster large-scale HD data, motivated by the idea that a local dense region of a HD dataset is usually distributed in a low-dimensional (LD) subspace. LDSDC follows a local-global-local processing structure, including grouping of local dense regions (atom clusters) followed by subspace Gaussian model (SGM) fitting (flexible and scalable to data dimension) at each sub-site, merging of atom clusters at every sub-site according to the merging result broadcast from the global site. Moreover, we propose a fast method to estimate the parameters of SGM for HD data, together with its convergence proof. We evaluate LDSDC on both synthetic and real datasets and compare it with four state-of-the-art methods. The experimental results demonstrate that the proposed LDSDC yields best overall performance.
引用
收藏
页码:1799 / 1814
页数:16
相关论文
共 50 条
  • [31] Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering
    Kriegel, Hans-Peter
    Kroeger, Peer
    Zimek, Arthur
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2009, 3 (01)
  • [32] A grid-based subspace clustering algorithm for high-dimensional data streams
    Sun, Yufen
    Lu, Yansheng
    WEB INFORMATION SYSTEMS - WISE 2006 WORKSHOPS, PROCEEDINGS, 2006, 4256 : 37 - 48
  • [33] ASCRClu: an adaptive subspace combination and reduction algorithm for clustering of high-dimensional data
    Kavan Fatehi
    Mohsen Rezvani
    Mansoor Fateh
    Pattern Analysis and Applications, 2020, 23 : 1651 - 1663
  • [34] Fast Adaptive K-Means Subspace Clustering for High-Dimensional Data
    Wang, Xiao-Dong
    Chen, Rung-Ching
    Yan, Fei
    Zeng, Zhi-Qiang
    Hong, Chao-Qun
    IEEE ACCESS, 2019, 7 : 42639 - 42651
  • [35] ASCRClu: an adaptive subspace combination and reduction algorithm for clustering of high-dimensional data
    Fatehi, Kavan
    Rezvani, Mohsen
    Fateh, Mansoor
    PATTERN ANALYSIS AND APPLICATIONS, 2020, 23 (04) : 1651 - 1663
  • [36] High-dimensional data clustering
    Bouveyron, C.
    Girard, S.
    Schmid, C.
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 52 (01) : 502 - 519
  • [37] Clustering High-Dimensional Data
    Masulli, Francesco
    Rovetta, Stefano
    CLUSTERING HIGH-DIMENSIONAL DATA, CHDD 2012, 2015, 7627 : 1 - 13
  • [38] Subspace clustering of high dimensional data
    Domeniconi, C
    Papadopoulos, D
    Gunopulos, D
    Ma, S
    Proceedings of the Fourth SIAM International Conference on Data Mining, 2004, : 517 - 521
  • [39] A Hybrid EA for High-dimensional Subspace Clustering Problem
    Lin, Lin
    Gen, Mitsuo
    Liang, Yan
    2014 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2014, : 2855 - 2860
  • [40] Clustering large spatial data with local-density and its application
    Wei, Guiyi
    Liu, Haiping
    Xie, Mande
    Information Technology Journal, 2009, 8 (04) : 476 - 485