A Genetic Algorithm Approach for Clustering Large Data Sets

被引:0
|
作者
Luchi, Diego [1 ]
Rodrigues, Alexandre [1 ]
Varejao, Flavio Miguel [1 ]
Santos, Willian [1 ]
机构
[1] Fed Univ State Espirito Santo, Vitoria, ES, Brazil
关键词
D O I
10.1109/ICTAI.2016.90
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present a sampling approach to run the k-means algorithm in large data sets. We propose a genetic algorithm to guide sampling based on evaluating the fitness of each individual of the population through the k-means clustering algorithm. Although we want a partition with the lowest SSE, our algorithm tries to find the sample with the highest SSE. After finding a good sample the remaining points of the entire data set are clustered using the nearest centroid and, after that, the SSE of the final solution is calculated. Our proposal is applied on a set of public domain data sets and the results are compared against two other methods: the k-means running in a uniform random sample of the data set; and the k-means in the complete data set. The results showed that our algorithm has a good trade off between quality and computational cost, especially for large data sets and higher number of clusters.
引用
收藏
页码:570 / 576
页数:7
相关论文
共 50 条
  • [41] Image-mapped data clustering: An efficient technique for clustering large data sets
    Al-Omari, Faruq
    Al-Fayoumi, Nabeel
    Al-Jarrah, Mohammad
    [J]. INTELLIGENT DATA ANALYSIS, 2008, 12 (06) : 573 - 586
  • [42] A GA-based clustering algorithm for large data sets with mixed numeric and categorical values
    Li, J
    Gao, XB
    Jiao, LC
    [J]. ICCIMA 2003: FIFTH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND MULTIMEDIA APPLICATIONS, PROCEEDINGS, 2003, : 102 - 107
  • [43] A Valid Clustering Algorithm for High-dimensional Large Data Sets Based on Distributed Method
    Guo Xian e
    Yan Junmei
    [J]. PROCEEDINGS OF 2009 INTERNATIONAL WORKSHOP ON INFORMATION SECURITY AND APPLICATION, 2009, : 1 - 6
  • [44] A EM Probabilistic Clustering Algorithm for Large Scale Data Sets based on Partial Constraints Information
    Yan S.
    Shunlin S.
    Yuquan Z.
    [J]. Advances in Information Sciences and Service Sciences, 2011, 3 (10): : 20 - 29
  • [45] A GA-based clustering algorithm for large data sets with mixed numeric and categorical values
    Li, J
    Gao, XB
    Jiao, LC
    [J]. THIRD INTERNATIONAL SYMPOSIUM ON MULTISPECTRAL IMAGE PROCESSING AND PATTERN RECOGNITION, PTS 1 AND 2, 2003, 5286 : 171 - 174
  • [46] SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets
    Luan, Tu
    Muralidharan, Harihara Subrahmaniam
    Alshehri, Marwan
    Mittra, Ipsa
    Pop, Mihai
    [J]. NUCLEIC ACIDS RESEARCH, 2023, 51 (08) : e46
  • [47] Performance of an ensemble clustering algorithm on biological data sets
    Pirim, Harun
    Gautam, Dilip
    Bhowmik, Tanmay
    Perkins, Andy D.
    Ekşioglu, Burak
    Alkan, Ahmet
    [J]. Mathematical and Computational Applications, 2011, 16 (01) : 87 - 96
  • [48] Introduce a New Algorithm for Data Clustering by Genetic Algorithm
    Vahidi, J.
    Mirpour, Saeed
    [J]. JOURNAL OF MATHEMATICS AND COMPUTER SCIENCE-JMCS, 2014, 10 (02): : 144 - 156
  • [49] Clustering Based Bagging Algorithm on Imbalanced Data Sets
    Sun, Xiao-Yan
    Zhang, Hua-Xiang
    Wang, Zhi-Chao
    [J]. INTEGRATED UNCERTAINTY IN KNOWLEDGE MODELLING AND DECISION MAKING, 2011, 7027 : 179 - 186
  • [50] An algorithm for adaptive clustering and visualisation of highdimensional data sets
    Schwenker, F
    Kestler, HA
    Palm, G
    [J]. COMPUTATIONAL INTELLIGENCE IN DATA MINING, 2000, (408): : 127 - 140