Utilizing the Buckshot Algorithm for Efficient Big Data Clustering in the MapReduce Model

被引:0
|
作者
Gerakidis, Sergios [1 ]
Mamalis, Basilis [2 ]
机构
[1] Hellen Open Univ, Patras, Greece
[2] Univ West Attica, Athens, Greece
关键词
Buckshot algorithm; Hierarchical agglomerative clustering; K-Means; Big Data; MapReduce; Spark;
D O I
10.1145/3368640.3368658
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are two of the most known and commonly used methods of clustering; the former due to its low time cost and the latter due to its accuracy. However, even the use of K-Means in document clustering over large-scale collections can lead to unpredictable time costs. In this paper, towards the direction of the efficient handling of big text data, we present a hybrid clustering approach based on a customized version of the Buckshot algorithm, which first applies a hierarchical clustering procedure on a sample of the input dataset and then uses the results as the initial centers for a K-Means based assignment of the remaining documents, with very few iterations. We also give a highly efficient adaptation of the proposed Buckshot-based approach in the MapReduce model which is then experimentally tested using Apache Hadoop over a real cluster environment. As it comes out of the experiments, it leads to acceptable clustering quality as well as to significant execution time improvements. Preliminary results drawn from relevant experiments using the Spark framework are also presented.
引用
收藏
页码:112 / 117
页数:6
相关论文
共 50 条
  • [1] MapReduce Clustering for Big Data
    Ghattas, Badih
    Pinto, Antoine
    Diao, Sambou
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 5116 - 5124
  • [2] Efficient MapReduce Kernel k-Means for Big Data Clustering
    Tsapanos, Nikolaos
    Tefas, Anastasios
    Nikolaidis, Nikolaos
    Pitas, Ioannis
    [J]. 9TH HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE (SETN 2016), 2016,
  • [3] Parallel Clustering Optimization Algorithm Based on MapReduce in Big Data Mining
    Zhang, Huajie
    Song, Lei
    Zhang, Sen
    [J]. IAENG International Journal of Applied Mathematics, 2023, 53 (01)
  • [4] An enhanced and efficient clustering algorithm for large data using MapReduce
    Li, Hongbiao
    Liu, Ruiying
    Wang, Jingdong
    Wu, Qilong
    [J]. IAENG International Journal of Computer Science, 2019, 46 (01):
  • [5] A Big Graph Clustering Algorithm Based on MapReduce
    Leng, Yonglin
    Zhang, Qingchen
    [J]. MODERN TECHNOLOGIES IN MATERIALS, MECHANICS AND INTELLIGENT SYSTEMS, 2014, 1049 : 1467 - +
  • [6] Clustering on Big Data Using Hadoop MapReduce
    Akthar, Nadeem
    Ahamad, Mohd Vasim
    Khan, Shahbaz
    [J]. 2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 789 - 795
  • [7] Efficient algorithm for big data clustering on single machine
    Alguliyev, Rasim M.
    Aliguliyev, Ramiz M.
    Sukhostat, Lyudmila, V
    [J]. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2020, 5 (01) : 9 - 14
  • [8] MapReduce based Method for Big Data Semantic Clustering
    Yang, Jie
    Li, Xiaoping
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC 2013), 2013, : 2814 - 2819
  • [9] Big data clustering with varied density based on MapReduce
    Safanaz Heidari
    Mahmood Alborzi
    Reza Radfar
    Mohammad Ali Afsharkazemi
    Ali Rajabzadeh Ghatari
    [J]. Journal of Big Data, 6
  • [10] Big data clustering with varied density based on MapReduce
    Heidari, Safanaz
    Alborzi, Mahmood
    Radfar, Reza
    Afsharkazemi, Mohammad Ali
    Ghatari, Ali Rajabzadeh
    [J]. JOURNAL OF BIG DATA, 2019, 6 (01)