A Genetic Algorithm Approach for Clustering Large Data Sets

被引：0

作者：

Luchi, Diego ^{[1
]}

Rodrigues, Alexandre ^{[1
]}

Varejao, Flavio Miguel ^{[1
]}

Santos, Willian ^{[1
]}

机构：

[1] Fed Univ State Espirito Santo, Vitoria, ES, Brazil

来源：

2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2016) | 2016年

关键词：

D O I：

10.1109/ICTAI.2016.90

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper we present a sampling approach to run the k-means algorithm in large data sets. We propose a genetic algorithm to guide sampling based on evaluating the fitness of each individual of the population through the k-means clustering algorithm. Although we want a partition with the lowest SSE, our algorithm tries to find the sample with the highest SSE. After finding a good sample the remaining points of the entire data set are clustered using the nearest centroid and, after that, the SSE of the final solution is calculated. Our proposal is applied on a set of public domain data sets and the results are compared against two other methods: the k-means running in a uniform random sample of the data set; and the k-means in the complete data set. The results showed that our algorithm has a good trade off between quality and computational cost, especially for large data sets and higher number of clusters.

引用

页码：570 / 576

页数：7

共 50 条

[1] A genetic algorithm for clustering on very large data sets
Gasvoda, J
Ding, Q
COMPUTER APPLICATIONS IN INDUSTRY AND ENGINEERING, 2003, : 163 - 167
[2] Data Clustering Based on Approach of Genetic Algorithm
Wang, Hai-hui
Zhao, Wen-jie
2008 CHINESE CONTROL AND DECISION CONFERENCE, VOLS 1-11, 2008, : 2753 - 2757
[3] Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics
Olman, Victor
Mao, Fenglou
Wu, Hongwei
Xu, Ying
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2009, 6 (02) : 344 - 352
[4] ON K-MEDOID CLUSTERING OF LARGE DATA SETS WITH THE AID OF A GENETIC ALGORITHM - BACKGROUND, FEASIBILITY AND COMPARISON
LUCASIUS, CB
DANE, AD
KATEMAN, G
ANALYTICA CHIMICA ACTA, 1993, 282 (03) : 647 - 669
[5] A Genetic Algorithm Based Modification on the LTS Algorithm for Large Data Sets
Satman, M. Hakan
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2012, 41 (05) : 644 - 652
[6] Genetic Sampling k-means for Clustering Large Data Sets
Luchi, Diego
Santos, Willian
Rodrigues, Alexandre
Varejao, Flavio Miguel
PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2015, 2015, 9423 : 691 - 698
[7] A hybrid algorithm for K-medoid clustering of large data sets
Sheng, WG
Liu, XH
CEC2004: PROCEEDINGS OF THE 2004 CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1 AND 2, 2004, : 77 - 82
[8] FCM-based clustering algorithm ensemble for large data sets
Li, Jie
Gao, Xinbo
Tian, Chunna
FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2006, 4223 : 559 - 567
[9] A CLUSTERING-ALGORITHM FOR DATA-SETS WITH A LARGE NUMBER OF CLASSES
ZHANG, Q
WANG, QR
BOYLE, R
PATTERN RECOGNITION, 1991, 24 (04) : 331 - 340
[10] Parallel Clustering Algorithm for Large-Scale Biological Data Sets
Wang, Minchao
Zhang, Wu
Ding, Wang
Dai, Dongbo
Zhang, Huiran
Xie, Hao
Chen, Luonan
Guo, Yike
Xie, Jiang
PLOS ONE, 2014, 9 (04):

← 1 2 3 4 5 →