Empirical comparison of fast partitioning-based clustering algorithms for large data sets

被引:35
|
作者
Wei, CP [1 ]
Lee, YH [1 ]
Hsu, CM [1 ]
机构
[1] Natl Sun Yat Sen Univ, Coll Management, Dept Informat Management, Kaohsiung 80424, Taiwan
关键词
data mining; clustering analysis; clustering algorithm comparison;
D O I
10.1016/S0957-4174(02)00185-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several fast algorithms for clustering very large data sets have been proposed in the literature, including CLARA, CLARANS, GAC-R-3, and GAC-RAR(w). CLARA is a combination of a sampling procedure and the classical PAM algorithm, while CLARANS adopts a serial randomized search strategy to find the optimal set of medoids. GAC-R-3 and GAC-RAR(w) exploit genetic search heuristics for solving clustering problems. In this research, we conducted an empirical comparison of these four clustering algorithms over a wide range of data characteristics described by data size, number of clusters, cluster distinctness, cluster asymmetry, and data randomness. According to the experimental results, CLARANS outperforms its counterparts both in clustering quality and execution time when the number of clusters increases, clusters are more closely related, more asymmetric clusters are present, or more random objects exist in the data set. With a specific number of clusters, CLARA can efficiently achieve satisfactory clustering quality when the data size is larger, whereas GAC-R-3 and GAC-RAR(w) can achieve satisfactory clustering quality and efficiency when the data size is small, the number of clusters is small, and clusters are more distinct and symmetric. (C) 2003 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:351 / 363
页数:13
相关论文
共 50 条
  • [31] Accelerated EM-based clustering of large data sets
    Jakob J. Verbeek
    Jan R. J. Nunnink
    Nikos Vlassis
    [J]. Data Mining and Knowledge Discovery, 2006, 13 : 291 - 307
  • [32] Accelerated EM-based clustering of large data sets
    Verbeek, Jakob J.
    Nunnink, Jan R. J.
    Vlassis, Nikos
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2006, 13 (03) : 291 - 307
  • [33] Efficient clustering of large data sets
    Ananthanarayana, VS
    Murty, MN
    Subramanian, DK
    [J]. PATTERN RECOGNITION, 2001, 34 (12) : 2561 - 2563
  • [34] Efficient algorithms for fast integration on large data sets from multiple sources
    Mi, Tian
    Rajasekaran, Sanguthevar
    Aseltine, Robert
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2012, 12
  • [35] Efficient algorithms for fast integration on large data sets from multiple sources
    Tian Mi
    Sanguthevar Rajasekaran
    Robert Aseltine
    [J]. BMC Medical Informatics and Decision Making, 12
  • [36] Partitioning-based approach to fast on-chip decoupling capacitor budgeting and minimization
    Li, Hang
    Fan, Jeffrey
    Qi, Zhenyu
    Tan, Sheldon X. -D.
    Wu, Lifeng
    Cai, Yici
    Hong, Xianlong
    [J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2006, 25 (11) : 2402 - 2412
  • [37] Engineering Algorithms for Large Data Sets
    Sanders, Peter
    [J]. SOFSEM 2013: Theory and Practice of Computer Science, 2013, 7741 : 29 - 32
  • [38] Registration and partitioning-based compression of 3-D dynamic data
    Gupta, S
    Sengupta, K
    Kassim, A
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2003, 13 (11) : 1144 - 1155
  • [39] A fast hierarchical clustering algorithm for large-scale protein sequence data sets
    Szilagyi, Sandor M.
    Szilagyi, Laszlo
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2014, 48 : 94 - 101
  • [40] A minimum spanning tree based partitioning and merging technique for clustering heterogeneous data sets
    Mishra, Gaurav
    Mohanty, Sraban Kumar
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2020, 55 (03) : 587 - 606