Local-Density Subspace Distributed Clustering for High-Dimensional Data

被引:7
|
作者
Geng, Yangli-ao [1 ]
Li, Qingyong [1 ]
Liang, Mingfei [2 ]
Chi, Chong-Yung [3 ]
Tan, Juan [4 ]
Huang, Heng [5 ,6 ]
机构
[1] Beijing Jiaotong Univ, Beijing Key Lab Transportat Data Anal & Min, Beijing 100044, Peoples R China
[2] Tencent Co Ltd, WeiXin Grp, Beijing 100044, Peoples R China
[3] Natl Tsing Hua Univ, Inst Commun Engn, Hsinchu 30013, Taiwan
[4] Beijing Technol & Business Univ, Dept Business Adm, Beijing 100048, Peoples R China
[5] Univ Pittsburgh, Dept Elect & Comp Engn, Pittsburgh, PA 15260 USA
[6] JD Finance Amer Corp, Mountain View, CA USA
基金
北京市自然科学基金;
关键词
Clustering algorithms; Distributed databases; Principal component analysis; Data models; Clustering methods; Big Data; Kernel; High-dimensional clustering; distributed clustering; density-base clustering; subspace Gaussian model; ALGORITHM;
D O I
10.1109/TPDS.2020.2975550
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Distributed clustering is emerging along with the advent of the era of big data. However, most existing established distributed clustering methods focus on problems caused by a large amount of data rather than caused by the large dimension of data. Consequently, they suffer the "curse" of dimensionality (e.g., poor performance and heavy network overhead) when high-dimensional (HD) data are clustered. In this article, we propose a distributed algorithm, referred to as Local Density Subspace Distributed Clustering (LDSDC) algorithm, to cluster large-scale HD data, motivated by the idea that a local dense region of a HD dataset is usually distributed in a low-dimensional (LD) subspace. LDSDC follows a local-global-local processing structure, including grouping of local dense regions (atom clusters) followed by subspace Gaussian model (SGM) fitting (flexible and scalable to data dimension) at each sub-site, merging of atom clusters at every sub-site according to the merging result broadcast from the global site. Moreover, we propose a fast method to estimate the parameters of SGM for HD data, together with its convergence proof. We evaluate LDSDC on both synthetic and real datasets and compare it with four state-of-the-art methods. The experimental results demonstrate that the proposed LDSDC yields best overall performance.
引用
收藏
页码:1799 / 1814
页数:16
相关论文
共 50 条
  • [1] Density Conscious Subspace Clustering for High-Dimensional Data
    Chu, Yi-Hong
    Huang, Jen-Wei
    Chuang, Kun-Ta
    Yang, De-Nian
    Chen, Ming-Syan
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2010, 22 (01) : 16 - 30
  • [2] Density-connected subspace clustering for high-dimensional data
    Kailing, K
    Kriegel, HP
    Kröger, P
    PROCEEDINGS OF THE FOURTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2004, : 246 - 256
  • [3] Accelerating Density-Based Subspace Clustering in High-Dimensional Data
    Prinzbach, Juergen
    Lauer, Tobias
    Kiefer, Nicolas
    21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS ICDMW 2021, 2021, : 474 - 481
  • [4] Subspace selection for clustering high-dimensional data
    Baumgartner, C
    Plant, C
    Kailing, K
    Kriegel, HP
    Kröger, P
    FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, : 11 - 18
  • [5] Evolutionary Subspace Clustering Algorithm for High-Dimensional Data
    Nourashrafeddin, S. N.
    Arnold, Dirk V.
    Milios, Evangelos
    PROCEEDINGS OF THE FOURTEENTH INTERNATIONAL CONFERENCE ON GENETIC AND EVOLUTIONARY COMPUTATION COMPANION (GECCO'12), 2012, : 1497 - 1498
  • [6] Subspace clustering of high-dimensional data: a predictive approach
    Brian McWilliams
    Giovanni Montana
    Data Mining and Knowledge Discovery, 2014, 28 : 736 - 772
  • [7] Subspace Clustering of High-Dimensional Data: An Evolutionary Approach
    Vijendra, Singh
    Laxman, Sahoo
    APPLIED COMPUTATIONAL INTELLIGENCE AND SOFT COMPUTING, 2013, 2013
  • [8] Subspace clustering of high-dimensional data: a predictive approach
    McWilliams, Brian
    Montana, Giovanni
    DATA MINING AND KNOWLEDGE DISCOVERY, 2014, 28 (03) : 736 - 772
  • [9] Subspace Clustering of Very Sparse High-Dimensional Data
    Peng, Hankui
    Pavlidis, Nicos
    Eckley, Idris
    Tsalamanis, Ioannis
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 3780 - 3783
  • [10] Local gap density for clustering high-dimensional data with varying densities
    Li, Ruijia
    Yang, Xiaofei
    Qin, Xiaolong
    Zhu, William
    KNOWLEDGE-BASED SYSTEMS, 2019, 184