Big Data Clustering with Kernel k-Means: Resources, Time and Performance

被引：2

作者：

Tsapanos, Nikolaos ^{[1
]}

Tefas, Anastasios ^{[1
]}

Nikolaidis, Nikolaos ^{[1
]}

Pitas, Ioannis ^{[1
]}

机构：

[1] Aristotle Univ Thessaloniki, Dept Informat, Univ Campus,Box 54 124, Thessaloniki, Greece

来源：

INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS | 2018年 / 27卷 / 04期

关键词：

Big data; kernel k-means; data clustering; approximate kernel k-means; Apache Spark; distributed computation; COMPUTATION; HISTOGRAMS;

D O I：

10.1142/S0218213018600060

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of classic k-Means regarding the non-linear separability of the input data. With respect to the challenges of Big Data research, a field that has established itself in the last few years and involves performing tasks on extremely large amounts of data, several adaptations of the Kernel k-Means have been proposed, each of which has different requirements in processing power and running time, while also incurring different trade-offs in performance. In this paper, we present several issues and techniques involving the usage of Kernel k-Means for Big Data clustering and how the combination of each component in a clustering framework fares in terms of resources, time and performance. We use experimental results, in order to evaluate several combinations and provide a recommendation on how to approach a Big Data clustering problem.

引用

页数：18

共 50 条

[1] Efficient MapReduce Kernel k-Means for Big Data Clustering
Tsapanos, Nikolaos
Tefas, Anastasios
Nikolaidis, Nikolaos
Pitas, Ioannis
[J]. 9TH HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE (SETN 2016), 2016,
[2] k-Means Clustering of Lines for Big Data
Marom, Yair
Feldman, Dan
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[3] Optimized Data Fusion for Kernel k-Means Clustering
Yu, Shi
Tranchevent, Leon-Charles
Liu, Xinhai
Glanzel, Wolfgang
Suykens, Johan A. K.
De Moor, Bart
Moreau, Yves
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2012, 34 (05) : 1031 - 1039
[4] How to Use K-means for Big Data Clustering?
Mussabayev, Rustam
Mladenovic, Nenad
Jarboui, Bassem
Mussabayev, Ravil
[J]. PATTERN RECOGNITION, 2023, 137
[5] Modified K-means Algorithm for Big Data Clustering
Sengupta, Debapriya
Roy, Sayantan Singha
Ghosh, Sarbani
Dasgupta, Ranjan
[J]. PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), 2017, : 1443 - 1448
[6] Parallel batch k-means for Big data clustering
Alguliyev, Rasim M.
Aliguliyev, Ramiz M.
Sukhostat, Lyudmila, V
[J]. COMPUTERS & INDUSTRIAL ENGINEERING, 2021, 152
[7] Kernel Probabilistic K-Means Clustering
Liu, Bowen
Zhang, Ting
Li, Yujian
Liu, Zhaoying
Zhang, Zhilin
[J]. SENSORS, 2021, 21 (05) : 1 - 16
[8] Sparse kernel k-means clustering
Park, Beomjin
Park, Changyi
Hong, Sungchul
Choi, Hosik
[J]. JOURNAL OF APPLIED STATISTICS, 2024,
[9] A Kernel K-means Clustering Method for Symbolic Interval Data
Costa, Anderson F. B. F.
Pimentel, Bruno A.
de Souza, Renata M. C. R.
[J]. 2010 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS IJCNN 2010, 2010,
[10] DYNAMIC TIME-ALIGNMENT K-MEANS KERNEL CLUSTERING FOR TIME SEQUENCE CLUSTERING
Santarcangelo, Joseph
Zhang, Xiao-Ping
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 2532 - 2536

← 1 2 3 4 5 →