A Genetic Algorithm Approach for Clustering Large Data Sets

被引:0
|
作者
Luchi, Diego [1 ]
Rodrigues, Alexandre [1 ]
Varejao, Flavio Miguel [1 ]
Santos, Willian [1 ]
机构
[1] Fed Univ State Espirito Santo, Vitoria, ES, Brazil
关键词
D O I
10.1109/ICTAI.2016.90
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present a sampling approach to run the k-means algorithm in large data sets. We propose a genetic algorithm to guide sampling based on evaluating the fitness of each individual of the population through the k-means clustering algorithm. Although we want a partition with the lowest SSE, our algorithm tries to find the sample with the highest SSE. After finding a good sample the remaining points of the entire data set are clustered using the nearest centroid and, after that, the SSE of the final solution is calculated. Our proposal is applied on a set of public domain data sets and the results are compared against two other methods: the k-means running in a uniform random sample of the data set; and the k-means in the complete data set. The results showed that our algorithm has a good trade off between quality and computational cost, especially for large data sets and higher number of clusters.
引用
收藏
页码:570 / 576
页数:7
相关论文
共 50 条
  • [1] A genetic algorithm for clustering on very large data sets
    Gasvoda, J
    Ding, Q
    [J]. COMPUTER APPLICATIONS IN INDUSTRY AND ENGINEERING, 2003, : 163 - 167
  • [2] Data Clustering Based on Approach of Genetic Algorithm
    Wang, Hai-hui
    Zhao, Wen-jie
    [J]. 2008 CHINESE CONTROL AND DECISION CONFERENCE, VOLS 1-11, 2008, : 2753 - 2757
  • [3] Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics
    Olman, Victor
    Mao, Fenglou
    Wu, Hongwei
    Xu, Ying
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2009, 6 (02) : 344 - 352
  • [4] ON K-MEDOID CLUSTERING OF LARGE DATA SETS WITH THE AID OF A GENETIC ALGORITHM - BACKGROUND, FEASIBILITY AND COMPARISON
    LUCASIUS, CB
    DANE, AD
    KATEMAN, G
    [J]. ANALYTICA CHIMICA ACTA, 1993, 282 (03) : 647 - 669
  • [5] A Genetic Algorithm Based Modification on the LTS Algorithm for Large Data Sets
    Satman, M. Hakan
    [J]. COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2012, 41 (05) : 644 - 652
  • [6] A Hybrid and Parameter-Free Clustering Algorithm for Large Data Sets
    Shao, Hengkang
    Zhang, Ping
    Chen, Xinye
    Li, Fang
    Du, Guanglong
    [J]. IEEE ACCESS, 2019, 7 : 24806 - 24818
  • [7] DESCRY: A density based clustering algorithm for very large data sets
    Angiulli, F
    Pizzuti, C
    Ruffolo, M
    [J]. INTELLIGENT DAA ENGINEERING AND AUTOMATED LEARNING IDEAL 2004, PROCEEDINGS, 2004, 3177 : 203 - 210
  • [8] FCM-based clustering algorithm ensemble for large data sets
    Li, Jie
    Gao, Xinbo
    Tian, Chunna
    [J]. FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2006, 4223 : 559 - 567
  • [9] A hybrid algorithm for K-medoid clustering of large data sets
    Sheng, WG
    Liu, XH
    [J]. CEC2004: PROCEEDINGS OF THE 2004 CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1 AND 2, 2004, : 77 - 82
  • [10] A CLUSTERING-ALGORITHM FOR DATA-SETS WITH A LARGE NUMBER OF CLASSES
    ZHANG, Q
    WANG, QR
    BOYLE, R
    [J]. PATTERN RECOGNITION, 1991, 24 (04) : 331 - 340