Parallel DBSCAN Algorithm Using a Data Partitioning Strategy with Spark Implementation

被引:0
|
作者
Han, Dianwei [1 ]
Agrawal, Ankit [1 ]
Liao, Wei-keng [1 ]
Choudhary, Alok [1 ]
机构
[1] Northwestern Univ, EECS Dept, Evanston, IL 60208 USA
关键词
DBSCAN; clustering; big data; Spark framework; scalable; MR-DBSCAN;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, existing parallel implementation strategies based on MPI lack fault tolerance and there is no guarantee that their workload is balanced. Although some of Hadoop-based approaches have been proposed, they do not perform well in terms of scalability since the merge process is not efficient. We propose a scalable parallel DBSCAN algorithm by applying a partitioning strategy. It is implemented in Apache Spark. In order to reduce search time, kdtree is used in our algorithm. To achieve better performance and scalability based on kdtree, we adopt an effective partitioning technique aimed at producing balanced sub-domains which can be computed within Spark executors. Moreover, we came up with a new merging technique: through mapping the relationship between the local points and their bordering neighbors, all the partial clusters which are generated in executors are merged to form the final complete clusters. We have observed and verified ( through experiments) that this merging approach is very effective in reducing the time taken for the merge phase and very scalable with increasing the number of processing cores and the generated partial clusters. We implemented the algorithm in Java, evaluated its scalability by using different number of processing cores, and using real and synthetic datasets containing up to several hundred million high-dimensional points. We used three scales of datasets to evaluate our implementation. For small scale, we use 50k, 100k, and 500k data points, obtaining up to a factor of 14.9 speedup when using 16 cores. For medium scale, we use 1.0m, 1.5m, and 1.9m data points, obtaining a factor of 109.2 speedup when using 128 cores. For large scale, we use 61.0m, 91.5m, and 115.9m data points, obtaining a factor of 8344.5 speedup when using 16384 cores.
引用
收藏
页码:305 / 312
页数:8
相关论文
共 50 条
  • [1] A Parallel DBSCAN Algorithm Based On Spark
    Luo, Guangchun
    Luo, Xiaoyu
    Gooch, Thomas Fairley
    Tian, Ling
    Qin, Ke
    PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCES ON BIG DATA AND CLOUD COMPUTING (BDCLOUD 2016) SOCIAL COMPUTING AND NETWORKING (SOCIALCOM 2016) SUSTAINABLE COMPUTING AND COMMUNICATIONS (SUSTAINCOM 2016) (BDCLOUD-SOCIALCOM-SUSTAINCOM 2016), 2016, : 548 - 553
  • [2] PARALLEL IMPLEMENTATION OF DBSCAN ALGORITHM USING MULTIPLE GRAPHICS ACCELERATORS
    Szenasi, Sandor
    INFORMATICS, GEOINFORMATICS AND REMOTE SENSING CONFERENCE PROCEEDINGS, SGEM 2016, VOL I, 2016, : 327 - 334
  • [3] RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning
    Song, Hwanjun
    Lee, Jae-Gil
    SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 1173 - 1187
  • [4] Data-partitioning-based DBSCAN algorithm
    Zhou, Shuigeng
    Zhou, Aoying
    Cao, Jing
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2000, 37 (10): : 1153 - 1159
  • [5] STRP-DBSCAN: A Parallel DBSCAN Algorithm Based on Spatial-Temporal Random Partitioning for Clustering Trajectory Data
    An, Xiaoya
    Wang, Ziming
    Wang, Ding
    Liu, Song
    Jin, Cheng
    Xu, Xinpeng
    Cao, Jianjun
    APPLIED SCIENCES-BASEL, 2023, 13 (20):
  • [6] Parallel Community Detection Algorithm Using a Data Partitioning Strategy with Pairwise Subdomain Duplication
    Palsetia, Diana
    Hendrix, William
    Lee, Sunwoo
    Agrawal, Ankit
    Liao, Wei-keng
    Choudhary, Alok
    HIGH PERFORMANCE COMPUTING, 2016, 9697 : 98 - 115
  • [7] A parallel SP-DBSCAN algorithm on spark for waiting spot recommendation
    Xia, Dawen
    Bai, Yu
    Zheng, Yongling
    Hu, Yang
    Li, Yantao
    Li, Huaqing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (03) : 4015 - 4038
  • [8] A parallel SP-DBSCAN algorithm on spark for waiting spot recommendation
    Dawen Xia
    Yu Bai
    Yongling Zheng
    Yang Hu
    Yantao Li
    Huaqing Li
    Multimedia Tools and Applications, 2022, 81 : 4015 - 4038
  • [9] Algorithm and Implementation of Distributed ESN Using Spark Framework and Parallel PSO
    Wu, Kehe
    Zhu, Yayun
    Li, Quan
    Han, Guolong
    APPLIED SCIENCES-BASEL, 2017, 7 (04):
  • [10] A novel scalable DBSCAN algorithm with Spark
    Han, Dianwei
    Agrawal, Ankit
    Liao, Wei-keng
    Choudhary, Alok
    2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 1393 - 1402