Parallel DBSCAN Algorithm Using a Data Partitioning Strategy with Spark Implementation

被引:0
|
作者
Han, Dianwei [1 ]
Agrawal, Ankit [1 ]
Liao, Wei-keng [1 ]
Choudhary, Alok [1 ]
机构
[1] Northwestern Univ, EECS Dept, Evanston, IL 60208 USA
关键词
DBSCAN; clustering; big data; Spark framework; scalable; MR-DBSCAN;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, existing parallel implementation strategies based on MPI lack fault tolerance and there is no guarantee that their workload is balanced. Although some of Hadoop-based approaches have been proposed, they do not perform well in terms of scalability since the merge process is not efficient. We propose a scalable parallel DBSCAN algorithm by applying a partitioning strategy. It is implemented in Apache Spark. In order to reduce search time, kdtree is used in our algorithm. To achieve better performance and scalability based on kdtree, we adopt an effective partitioning technique aimed at producing balanced sub-domains which can be computed within Spark executors. Moreover, we came up with a new merging technique: through mapping the relationship between the local points and their bordering neighbors, all the partial clusters which are generated in executors are merged to form the final complete clusters. We have observed and verified ( through experiments) that this merging approach is very effective in reducing the time taken for the merge phase and very scalable with increasing the number of processing cores and the generated partial clusters. We implemented the algorithm in Java, evaluated its scalability by using different number of processing cores, and using real and synthetic datasets containing up to several hundred million high-dimensional points. We used three scales of datasets to evaluate our implementation. For small scale, we use 50k, 100k, and 500k data points, obtaining up to a factor of 14.9 speedup when using 16 cores. For medium scale, we use 1.0m, 1.5m, and 1.9m data points, obtaining a factor of 109.2 speedup when using 128 cores. For large scale, we use 61.0m, 91.5m, and 115.9m data points, obtaining a factor of 8344.5 speedup when using 16384 cores.
引用
收藏
页码:305 / 312
页数:8
相关论文
共 50 条
  • [31] First Filling Strategy-Based Partitioning Method to Balance Data in Spark
    He, Yu-Lin
    Wu, Dong-Tong
    Philippe, Fournier-Viger
    Huang, Zhe-Xue
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2024, 52 (10): : 3322 - 3335
  • [32] VLSI Partitioning using Parallel Kernighan Lin Algorithm
    Rajan, Archana K.
    Bhaiya, Deepika
    2017 INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING (ICCSP), 2017, : 1897 - 1901
  • [33] Spark Accelerated Implementation of Parallel Attribute Reduction from Incomplete Data
    Cao, Qian
    Luo, Chuan
    Li, Tianrui
    Chen, Hongmei
    ROUGH SETS (IJCRS 2021), 2021, 12872 : 203 - 217
  • [34] K-DBSCAN: An improved DBSCAN algorithm for big data
    Gholizadeh, Nahid
    Saadatfar, Hamid
    Hanafi, Nooshin
    JOURNAL OF SUPERCOMPUTING, 2021, 77 (06): : 6214 - 6235
  • [35] Dominant Partitioning of Discontinuities of Rock Masses Based on DBSCAN Algorithm
    Ruan, Yunkai
    Liu, Weicheng
    Wang, Tanhua
    Chen, Jinzi
    Zhou, Xin
    Sun, Yunqiang
    APPLIED SCIENCES-BASEL, 2023, 13 (15):
  • [36] Parallel Implementation of FP Growth Algorithm on XML Data Using Multiple GPU
    Rathi, Sheetal
    Dhote, C. A.
    INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS, VOL 1, 2015, 339 : 581 - 589
  • [37] Data Preprocessing with GPU for DBSCAN Algorithm
    Cal, Piotr
    Wozniak, Michal
    PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON COMPUTER RECOGNITION SYSTEMS CORES 2013, 2013, 226 : 793 - 801
  • [38] Incremental Data Partitioning of RDF Data in SPARK
    Agathangelos, Giannis
    Troullinou, Georgia
    Kondylakis, Haridimos
    Stefanidis, Kostas
    Plexousakis, Dimitris
    SEMANTIC WEB: ESWC 2018 SATELLITE EVENTS, 2018, 11155 : 50 - 54
  • [39] DBSCAN algorithm for AIS data reconstruction
    Mieczynska, Marta
    Czarnowski, Ireneusz
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 2512 - 2521
  • [40] Parallel implementation of evolution strategy optimization algorithm on multicore processors
    Ivanov, Petar
    Brandisky, Kostadin
    COMPEL-THE INTERNATIONAL JOURNAL FOR COMPUTATION AND MATHEMATICS IN ELECTRICAL AND ELECTRONIC ENGINEERING, 2009, 28 (05) : 1129 - 1140