A Genetic Algorithm Approach for Clustering Large Data Sets

被引:0
|
作者
Luchi, Diego [1 ]
Rodrigues, Alexandre [1 ]
Varejao, Flavio Miguel [1 ]
Santos, Willian [1 ]
机构
[1] Fed Univ State Espirito Santo, Vitoria, ES, Brazil
关键词
D O I
10.1109/ICTAI.2016.90
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present a sampling approach to run the k-means algorithm in large data sets. We propose a genetic algorithm to guide sampling based on evaluating the fitness of each individual of the population through the k-means clustering algorithm. Although we want a partition with the lowest SSE, our algorithm tries to find the sample with the highest SSE. After finding a good sample the remaining points of the entire data set are clustered using the nearest centroid and, after that, the SSE of the final solution is calculated. Our proposal is applied on a set of public domain data sets and the results are compared against two other methods: the k-means running in a uniform random sample of the data set; and the k-means in the complete data set. The results showed that our algorithm has a good trade off between quality and computational cost, especially for large data sets and higher number of clusters.
引用
收藏
页码:570 / 576
页数:7
相关论文
共 50 条
  • [21] Advanced K-Means Clustering Algorithm for Large ECG Data Sets Based on K-SVD Approach
    Balouchestani, Mohammadreza
    Sugavaneswaran, Lakshmi
    Krishnan, Sridhar
    [J]. 2014 9TH INTERNATIONAL SYMPOSIUM ON COMMUNICATION SYSTEMS, NETWORKS & DIGITAL SIGNAL PROCESSING (CSNDSP), 2014, : 177 - 182
  • [22] A New Clustering Algorithm On Nominal Data Sets
    Wang, Bin
    [J]. INTERNATIONAL MULTICONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS (IMECS 2010), VOLS I-III, 2010, : 605 - 610
  • [23] Batch Clustering Algorithm for Big Data Sets
    Alguliyev, Rasim
    Aliguliyev, Ramiz
    Bagirov, Adil
    Karimov, Rafael
    [J]. 2016 IEEE 10TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2016, : 79 - 82
  • [24] Bayesian nonparametric clustering for large data sets
    Zuanetti, Daiane Aparecida
    Mueller, Peter
    Zhu, Yitan
    Yang, Shengjie
    Ji, Yuan
    [J]. STATISTICS AND COMPUTING, 2019, 29 (02) : 203 - 215
  • [25] Clustering Analysis for Large Scale Data Sets
    Singh, Sachin
    Mishra, Ashish
    [J]. 2015 INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION & AUTOMATION (ICCCA), 2015, : 1 - 4
  • [26] CLUSTERING OF LARGE DATA SETS - ZUPAN,J
    EVERITT, BS
    [J]. STATISTICIAN, 1983, 32 (03): : 355 - 355
  • [27] Bayesian nonparametric clustering for large data sets
    Daiane Aparecida Zuanetti
    Peter Müller
    Yitan Zhu
    Shengjie Yang
    Yuan Ji
    [J]. Statistics and Computing, 2019, 29 : 203 - 215
  • [28] Clustering Very Large Dissimilarity Data Sets
    Hammer, Barbara
    Hasenfuss, Alexander
    [J]. ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, PROCEEDINGS, 2010, 5998 : 259 - +
  • [29] Clustering Algorithms for Large Temporal Data Sets
    Scepi, Germana
    [J]. DATA ANALYSIS AND CLASSIFICATION, 2010, : 369 - 377
  • [30] CLUSTERING OF LARGE DATA SETS - ZUPAN,J
    WHITE, M
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1983, 78 (383) : 733 - 734