MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

被引:108
|
作者
He, Yaobin [1 ,3 ]
Tan, Haoyu [2 ]
Luo, Wuman [2 ]
Feng, Shengzhong [1 ]
Fan, Jianping [1 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen 518055, Peoples R China
[2] Hong Kong Univ Sci & Technol, Guangzhou HKUST Fok Ying Tung Res Inst, Dept Comp Sci, Hong Kong 999077, Hong Kong, Peoples R China
[3] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
关键词
data clustering; parallel algorithm; data mining; load balancing; CLUSTERING-ALGORITHM;
D O I
10.1007/s11704-013-3158-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.
引用
收藏
页码:83 / 99
页数:17
相关论文
共 50 条
  • [1] MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
    Yaobin He
    Haoyu Tan
    Wuman Luo
    Shengzhong Feng
    Jianping Fan
    [J]. Frontiers of Computer Science, 2014, 8 : 83 - 99
  • [2] MR-DBSCAN:a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
    Yaobin HE
    Haoyu TAN
    Wuman LUO
    Shengzhong FENG
    Jianping FAN
    [J]. Frontiers of Computer Science, 2014, 8 (01) : 83 - 99
  • [3] A MapReduce-based improvement algorithm for DBSCAN
    Hu, Xiaojuan
    Liu, Lei
    Qiu, Ningjia
    Yang, Di
    Li, Meng
    [J]. JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY, 2018, 12 (01) : 53 - 61
  • [4] MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce
    He, Yaobin
    Tan, Haoyu
    Luo, Wuman
    Mao, Huajian
    Ma, Di
    Feng, Shengzhong
    Fan, Jianping
    [J]. 2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, : 473 - 480
  • [5] μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality
    Sarma, Aditya
    Goyal, Poonam
    Kumari, Sonal
    Wani, Anand
    Challa, Jagat Sesh
    Islam, Saiyedul
    Goyal, Navneet
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 171 - 181
  • [6] Research on Parallel DBSCAN Algorithm Design Based on MapReduce
    Fu Yan Xiang
    Zhao Wei Zhong
    Ma Hui Fang
    [J]. ADVANCED MEASUREMENT AND TEST, PTS 1-3, 2011, 301-303 : 1133 - +
  • [7] Research of parallel DBSCAN clustering algorithm based on MapReduce
    [J]. Fu, X. (xffu@gdut.edu.cn), 1600, Science and Engineering Research Support Society (07):
  • [8] MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm
    Li, Yufeng
    Jiang, HaiTian
    Lu, Jiyong
    Li, Xiaozhong
    Sun, Zhiwei
    Li, Min
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (03) : 5295 - 5305
  • [9] Towards a New Approach for Empowering the MR-DBSCAN Clustering for Massive Data using Quadtree
    Ibrahim, Rami
    Shafiq, M. Omair
    [J]. IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 91 - 98
  • [10] Summarization using Mapreduce Framework based Big Data and Hybrid Algorithm (HMM and DBSCAN)
    Belerao, Krushnadeo Tanaji
    Chaudhari, S. B.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON POWER, CONTROL, SIGNALS AND INSTRUMENTATION ENGINEERING (ICPCSI), 2017, : 377 - 380