Parallel DBSCAN Algorithm Using a Data Partitioning Strategy with Spark Implementation

被引:0
|
作者
Han, Dianwei [1 ]
Agrawal, Ankit [1 ]
Liao, Wei-keng [1 ]
Choudhary, Alok [1 ]
机构
[1] Northwestern Univ, EECS Dept, Evanston, IL 60208 USA
关键词
DBSCAN; clustering; big data; Spark framework; scalable; MR-DBSCAN;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, existing parallel implementation strategies based on MPI lack fault tolerance and there is no guarantee that their workload is balanced. Although some of Hadoop-based approaches have been proposed, they do not perform well in terms of scalability since the merge process is not efficient. We propose a scalable parallel DBSCAN algorithm by applying a partitioning strategy. It is implemented in Apache Spark. In order to reduce search time, kdtree is used in our algorithm. To achieve better performance and scalability based on kdtree, we adopt an effective partitioning technique aimed at producing balanced sub-domains which can be computed within Spark executors. Moreover, we came up with a new merging technique: through mapping the relationship between the local points and their bordering neighbors, all the partial clusters which are generated in executors are merged to form the final complete clusters. We have observed and verified ( through experiments) that this merging approach is very effective in reducing the time taken for the merge phase and very scalable with increasing the number of processing cores and the generated partial clusters. We implemented the algorithm in Java, evaluated its scalability by using different number of processing cores, and using real and synthetic datasets containing up to several hundred million high-dimensional points. We used three scales of datasets to evaluate our implementation. For small scale, we use 50k, 100k, and 500k data points, obtaining up to a factor of 14.9 speedup when using 16 cores. For medium scale, we use 1.0m, 1.5m, and 1.9m data points, obtaining a factor of 109.2 speedup when using 128 cores. For large scale, we use 61.0m, 91.5m, and 115.9m data points, obtaining a factor of 8344.5 speedup when using 16384 cores.
引用
收藏
页码:305 / 312
页数:8
相关论文
共 50 条
  • [21] RT-DBSCAN: Real-Time Parallel Clustering of Spatio-Temporal Data Using Spark-Streaming
    Gong, Yikai
    Sinnott, Richard O.
    Rimba, Paul
    COMPUTATIONAL SCIENCE - ICCS 2018, PT I, 2018, 10860 : 524 - 539
  • [22] DPQR: An improved parallel DBSCAN algorithm based on data partition and QR*-tree
    Xu, Hongbo
    Yao, Nianmin
    Han, Qilong
    Pan, Haiwei
    Computer Modelling and New Technologies, 2014, 18 (12): : 209 - 214
  • [23] The Parallel Fuzzy C-Median Clustering Algorithm Using Spark for the Big Data
    Alam Mallik, Moksud
    Fariza Zulkurnain, Nurul
    Siddiqui, Sumrana
    Sarkar, Rashel
    IEEE ACCESS, 2024, 12 : 151785 - 151804
  • [24] A Tabu search based clustering algorithm and its parallel implementation on Spark
    Lu, Yinhao
    Cao, Buyang
    Rego, Cesar
    Glover, Fred
    APPLIED SOFT COMPUTING, 2018, 63 : 97 - 109
  • [25] Improvement and Parallel Implementation of EKF Battery Algorithm Based on the Spark Streaming
    Qi, Zitong
    Yang, Zhengqiu
    Liu, Chen
    Xiu, Jiapeng
    PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 824 - 829
  • [26] Parallel and Distributed Implementation of Sine Cosine Algorithm on Apache Spark Platform
    Alfailakawi, Mohammad Gh.
    Aljame, Maryam
    Ahmad, Imtiaz
    IEEE ACCESS, 2021, 9 : 77188 - 77202
  • [27] A distributed implementation using Apache Spark of a genetic algorithm applied to test data generation
    Paduraru, Ciprian
    Melemciuc, Marius-Constantin
    Stefanescu, Alin
    PROCEEDINGS OF THE 2017 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE COMPANION (GECCO'17 COMPANION), 2017, : 1857 - 1863
  • [28] FPGA based accelerator for parallel DBSCAN algorithm
    Shi, Shaobo
    Yue, Qi
    Wang, Qin
    Wang, Q. (337816437@qq.com), 1600, Transport and Telecommunication Institute (18): : 135 - 142
  • [29] USING DATA PARTITIONING TO IMPLEMENT A PARALLEL ASSEMBLER
    KATSEFF, HP
    SIGPLAN NOTICES, 1988, 23 (09): : 66 - 76
  • [30] K-DBSCAN: An improved DBSCAN algorithm for big data
    Nahid Gholizadeh
    Hamid Saadatfar
    Nooshin Hanafi
    The Journal of Supercomputing, 2021, 77 : 6214 - 6235