Utilizing the Buckshot Algorithm for Efficient Big Data Clustering in the MapReduce Model

被引：0

作者：

Gerakidis, Sergios ^{[1
]}

Mamalis, Basilis ^{[2
]}

机构：

[1] Hellen Open Univ, Patras, Greece

[2] Univ West Attica, Athens, Greece

来源：

PROCEEDINGS OF THE 23RD PAN-HELLENIC CONFERENCE OF INFORMATICS (PCI 2019) | 2019年

关键词：

Buckshot algorithm; Hierarchical agglomerative clustering; K-Means; Big Data; MapReduce; Spark;

D O I：

10.1145/3368640.3368658

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are two of the most known and commonly used methods of clustering; the former due to its low time cost and the latter due to its accuracy. However, even the use of K-Means in document clustering over large-scale collections can lead to unpredictable time costs. In this paper, towards the direction of the efficient handling of big text data, we present a hybrid clustering approach based on a customized version of the Buckshot algorithm, which first applies a hierarchical clustering procedure on a sample of the input dataset and then uses the results as the initial centers for a K-Means based assignment of the remaining documents, with very few iterations. We also give a highly efficient adaptation of the proposed Buckshot-based approach in the MapReduce model which is then experimentally tested using Apache Hadoop over a real cluster environment. As it comes out of the experiments, it leads to acceptable clustering quality as well as to significant execution time improvements. Preliminary results drawn from relevant experiments using the Spark framework are also presented.

引用

页码：112 / 117

页数：6

共 50 条

[1] MapReduce Clustering for Big Data
Ghattas, Badih
Pinto, Antoine
Diao, Sambou
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 5116 - 5124
[2] Efficient MapReduce Kernel k-Means for Big Data Clustering
Tsapanos, Nikolaos
Tefas, Anastasios
Nikolaidis, Nikolaos
Pitas, Ioannis
[J]. 9TH HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE (SETN 2016), 2016,
[3] Parallel Clustering Optimization Algorithm Based on MapReduce in Big Data Mining
Zhang, Huajie
Song, Lei
Zhang, Sen
[J]. IAENG International Journal of Applied Mathematics, 2023, 53 (01)
[4] An enhanced and efficient clustering algorithm for large data using MapReduce
Li, Hongbiao
Liu, Ruiying
Wang, Jingdong
Wu, Qilong
[J]. IAENG International Journal of Computer Science, 2019, 46 (01):
[5] A Big Graph Clustering Algorithm Based on MapReduce
Leng, Yonglin
Zhang, Qingchen
[J]. MODERN TECHNOLOGIES IN MATERIALS, MECHANICS AND INTELLIGENT SYSTEMS, 2014, 1049 : 1467 - +
[6] Clustering on Big Data Using Hadoop MapReduce
Akthar, Nadeem
Ahamad, Mohd Vasim
Khan, Shahbaz
[J]. 2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 789 - 795
[7] Efficient algorithm for big data clustering on single machine
Alguliyev, Rasim M.
Aliguliyev, Ramiz M.
Sukhostat, Lyudmila, V
[J]. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2020, 5 (01) : 9 - 14
[8] MapReduce based Method for Big Data Semantic Clustering
Yang, Jie
Li, Xiaoping
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC 2013), 2013, : 2814 - 2819
[9] Big data clustering with varied density based on MapReduce
Safanaz Heidari
Mahmood Alborzi
Reza Radfar
Mohammad Ali Afsharkazemi
Ali Rajabzadeh Ghatari
[J]. Journal of Big Data, 6
[10] Big data clustering with varied density based on MapReduce
Heidari, Safanaz
Alborzi, Mahmood
Radfar, Reza
Afsharkazemi, Mohammad Ali
Ghatari, Ali Rajabzadeh
[J]. JOURNAL OF BIG DATA, 2019, 6 (01)

← 1 2 3 4 5 →