Empirical comparison of fast partitioning-based clustering algorithms for large data sets

被引:35
|
作者
Wei, CP [1 ]
Lee, YH [1 ]
Hsu, CM [1 ]
机构
[1] Natl Sun Yat Sen Univ, Coll Management, Dept Informat Management, Kaohsiung 80424, Taiwan
关键词
data mining; clustering analysis; clustering algorithm comparison;
D O I
10.1016/S0957-4174(02)00185-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several fast algorithms for clustering very large data sets have been proposed in the literature, including CLARA, CLARANS, GAC-R-3, and GAC-RAR(w). CLARA is a combination of a sampling procedure and the classical PAM algorithm, while CLARANS adopts a serial randomized search strategy to find the optimal set of medoids. GAC-R-3 and GAC-RAR(w) exploit genetic search heuristics for solving clustering problems. In this research, we conducted an empirical comparison of these four clustering algorithms over a wide range of data characteristics described by data size, number of clusters, cluster distinctness, cluster asymmetry, and data randomness. According to the experimental results, CLARANS outperforms its counterparts both in clustering quality and execution time when the number of clusters increases, clusters are more closely related, more asymmetric clusters are present, or more random objects exist in the data set. With a specific number of clusters, CLARA can efficiently achieve satisfactory clustering quality when the data size is larger, whereas GAC-R-3 and GAC-RAR(w) can achieve satisfactory clustering quality and efficiency when the data size is small, the number of clusters is small, and clusters are more distinct and symmetric. (C) 2003 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:351 / 363
页数:13
相关论文
共 50 条
  • [1] Partitioning clustering algorithms for protein sequence data sets
    Fayech, Sondes
    Essoussi, Nadia
    Limam, Mohamed
    [J]. BIODATA MINING, 2009, 2
  • [2] Clustering Algorithms for Large Temporal Data Sets
    Scepi, Germana
    [J]. DATA ANALYSIS AND CLASSIFICATION, 2010, : 369 - 377
  • [3] Partitioning-based clustering for Web document categorization
    Boley, D
    Gini, M
    Gross, R
    Han, EH
    Hastings, K
    Karypis, G
    Kumar, V
    Mobasher, B
    Moore, J
    [J]. DECISION SUPPORT SYSTEMS, 1999, 27 (03) : 329 - 341
  • [4] Fuzzy joint points based clustering algorithms for large data sets
    Nasibov, Efendi
    Atilgan, Can
    Berberler, Murat Ersen
    Nasiboglu, Resmiye
    [J]. FUZZY SETS AND SYSTEMS, 2015, 270 : 111 - 126
  • [5] Comparison of Pagination Algorithms Based-on Large Data Sets
    Cao, Junkuo
    Wang, Weihua
    Shu, Yuanzhong
    [J]. INFORMATION AND AUTOMATION, 2011, 86 : 384 - 389
  • [6] A partitioning-based divisive clustering technique for maximizing the modularity
    Catalyuerek, Uemit V.
    Kaya, Kamer
    Langguth, Johannes
    Ucar, Bora
    [J]. GRAPH PARTITIONING AND GRAPH CLUSTERING, 2013, 588 : 171 - +
  • [7] A novel partitioning-based clustering method and generic document summarization
    Aliguliyev, Ramiz M.
    [J]. 2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WORKSHOPS PROCEEDINGS, 2006, : 626 - 629
  • [8] Fast SNN-Based Clustering Approach for Large Geospatial Data Sets
    Antunes, Armenio
    Santos, Maribel Yasmina
    Moreira, Adriano
    [J]. CONNECTING A DIGITAL EUROPE THROUGH LOCATION AND PLACE, 2014, : 179 - 195
  • [9] Empirical Investigation of Consensus Clustering for Large ECG Data Sets
    Kelarev, Andrei
    Stranieri, Andrew
    Yearwood, John
    Jelinek, Herbert
    [J]. 2012 25TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS), 2012,
  • [10] Data partitioning-based parallel irregular reductions
    Gutiérrez, E
    Plata, O
    Zapata, EL
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2004, 16 (2-3): : 155 - 172