HCDC: A novel hierarchical clustering algorithm based on density-distance cores for data sets with varying density

被引:11
|
作者
Yang, Qi-Fen [1 ]
Gao, Wan-Yi [2 ]
Han, Gang [3 ]
Li, Zi-Yang [4 ]
Tian, Meng [5 ]
Zhu, Shu-Hua [1 ]
Deng, Yu-hui [1 ]
机构
[1] Jinan Univ, Coll Informat Sci & Technol, Guangzhou 510000, Peoples R China
[2] Jinan Univ, Sch Econ, Guangzhou 510000, Peoples R China
[3] Jinan Univ, Grad Sch, Guangzhou 510632, Peoples R China
[4] Northeast Agr Univ, Coll Art, Harbin 150030, Heilongjiang, Peoples R China
[5] Jingzhou vocat & Tech Coll, Sch informat & Commun Engn, Jingzhou 434020, Peoples R China
关键词
Density-distance representative; Data mining; Hierarchical clustering; Natural neighbor; Varying density; NEIGHBOR; SEARCH; ROBUST;
D O I
10.1016/j.is.2022.102159
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cluster analysis is a crucial data mining technology widely used in image segmentation, language processing, and pattern recognition. Most existing clustering algorithms cannot identify complex shapes in manifold data sets and data sets with varying-density distribution, especially when clusters with significant differences in density are close to each other. Hierarchical clustering algorithms can identify data sets of arbitrary shapes. However, hierarchical clustering algorithms not only cannot cluster datasets with significant density variations but also have a high time cost. So in this paper, we propose a novel hierarchical clustering algorithm based on density-distance cores, called HCDC. It first selects the density-distance representative points for each point from the set of candidate representative points. Then it selects density-distance cores from all density-distance representatives. And it replaces the whole data set with density-distance cores and uses a new distance between them to apply hierarchical clustering. To avoid the influence of noise points in the dataset when finding density-distance cores, we also propose the noise point detection method and verify the feasibility of this method. In this paper, we compare our proposed algorithm with existing classical and novel algorithms on synthetic and real datasets. Experiments show that our algorithm clusters better than existing algorithms on complex-shaped datasets and datasets with different densities. On datasets with sparse and dense clusters close to each other, the ARI score of HCDC is more than 0.1 higher than that of LDP-MST. In particular, on the grid dataset, HCDC's ARI score is 0.997 higher than LDP-MST. On DS3 and DS8, HCDC's ARI score is more than 0.14 higher than the second-best algorithm, RNN-DBSCAN. Moreover, on the zoo dataset, HCDC's ARI score is 0.15 and 0.6 higher than RNN-DBSCAN and LDP-MST, respectively. On the olivetti face dataset, HCDC is the only algorithm with an NMI score above 0.9 on photo1 and photo2 datasets.(c) 2022 Elsevier Ltd. All rights reserved.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] An automatic density peaks clustering based on a density-distance clustering index
    Xu, Xiao
    Liao, Hong
    Yang, Xu
    [J]. AIMS MATHEMATICS, 2023, 8 (12): : 28926 - 28950
  • [2] Clustering based on density-distance and t mixture model in flow cytometry data
    Zhao, Qijie
    Ke, Zhennan
    Tao, Jing
    Lu, Jianxia
    [J]. Yi Qi Yi Biao Xue Bao/Chinese Journal of Scientific Instrument, 2017, 38 (09): : 2130 - 2137
  • [3] A novel temporal protein complexes identification framework based on density-distance and heuristic algorithm
    Xie, Dan
    Yi, Yang
    Zhou, Jin
    Li, Xiaodong
    Wu, Huikun
    [J]. NEURAL COMPUTING & APPLICATIONS, 2019, 31 (09): : 4693 - 4701
  • [4] Density-Distance Outlier Detection Algorithm Based on Natural Neighborhood
    Zhang, Jiaxuan
    Yang, Youlong
    [J]. AXIOMS, 2023, 12 (05)
  • [5] A New Density Based Clustering Algorithm for Binary Data Sets
    Nanda, Satyasai Jagannath
    Raman, Rahul
    Vijay, Shubham
    Bhardwaj, Anil
    [J]. 2014 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND APPLICATIONS (ICHPCA), 2014,
  • [6] Density-based clustering algorithm for mixture data sets
    Huang, De-Cai
    Wu, Tian-Hong
    [J]. Kongzhi yu Juece/Control and Decision, 2010, 25 (03): : 416 - 421
  • [7] A local cores-based hierarchical clustering algorithm for data sets with complex structures
    Cheng, Dongdong
    Zhu, Qingsheng
    Wu, Quanwang
    [J]. 2018 IEEE 42ND ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2018, : 410 - 419
  • [8] A local cores-based hierarchical clustering algorithm for data sets with complex structures
    Cheng, Dongdong
    Zhu, Qingsheng
    Huang, Jinlong
    Wu, Quanwang
    Yang, Lijun
    [J]. NEURAL COMPUTING & APPLICATIONS, 2019, 31 (11): : 8051 - 8068
  • [9] A local cores-based hierarchical clustering algorithm for data sets with complex structures
    Dongdong Cheng
    Qingsheng Zhu
    Jinlong Huang
    Quanwang Wu
    Lijun Yang
    [J]. Neural Computing and Applications, 2019, 31 : 8051 - 8068
  • [10] A Domain Adaptive Density Clustering Algorithm for Data With Varying Density Distribution
    Chen, Jianguo
    Yu, Philip S.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (06) : 2310 - 2321