A Spark-based Apriori algorithm with reduced shuffle overhead

被引:0
|
作者
Shashi Raj
Dharavath Ramesh
Krishan Kumar Sethi
机构
[1] Bakhtiyarpur College of Engineering,Department of Computer Science and Engineering
[2] Indian Institute of Technology (ISM),Department of Computer Science and Engineering
来源
关键词
Apache Spark; Apriori algorithm; Large-scale datasets; Shuffle overhead;
D O I
暂无
中图分类号
学科分类号
摘要
Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.
引用
收藏
页码:133 / 151
页数:18
相关论文
共 50 条
  • [41] Spark-Based Parallel Genetic Algorithm for Simulating a Solution of Optimal Deployment of an Underwater Sensor Network
    Liu, Peng
    Ye, Shuai
    Wang, Can
    Zhu, Zongwei
    SENSORS, 2019, 19 (12)
  • [42] A Spark-based Parallel Simulation Approach for Repairable System
    Liu, Yan
    Ren, Yi
    Liu, Linlin
    Li, Zhifeng
    ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM 2016 PROCEEDINGS, 2016,
  • [43] CHARACTERIZATION OF A SPARK-BASED, NONTHERMAL PLASMA FOR GERMICIDAL CAPABILITIES
    Ferrell, J. R.
    Fulton, J. A.
    Woolverton, C. J.
    WOUND REPAIR AND REGENERATION, 2011, 19 (02) : A22 - A22
  • [44] Spark-Based Classification Algorithms for Daily Living Activities
    Moldovan, Dorin
    Antal, Marcel
    Pop, Claudia
    Olosutean, Adrian
    Cioara, Tudor
    Anghel, Ionut
    Salomie, Ioan
    ARTIFICIAL INTELLIGENCE AND ALGORITHMS IN INTELLIGENT SYSTEMS, 2019, 764 : 69 - 78
  • [45] An efficient spark-based adaptive windowing for entity matching
    Mestre, Demetrio Gomes
    Santos Pires, Carlos Eduardo
    Nascimento, Dimas Cassimiro
    Monteiro de Queiroz, Andreza Raquel
    Santos, Veruska Borges
    Araujo, Tiago Brasileiro
    JOURNAL OF SYSTEMS AND SOFTWARE, 2017, 128 : 1 - 10
  • [46] A Spark-Based Parallel Fuzzy c-Means Segmentation Algorithm for Agricultural Image Big Data
    Liu, Bin
    He, Songrui
    He, Dongjian
    Zhang, Yin
    Guizani, Mohsen
    IEEE ACCESS, 2019, 7 : 42169 - 42180
  • [47] Spark-Based Iterative Spatial Overlay Analysis Method
    Zhao, Zheng
    Chen, Luo
    Wu, Ye
    Jing, Ning
    PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON ELECTRONIC INDUSTRY AND AUTOMATION (EIA 2017), 2017, 145 : 227 - 232
  • [48] A Novel Compression Algorithm Decision Method for Spark Shuffle Process
    Huang, Shanshan
    Xu, Jungang
    Liu, Renfeng
    Liao, Husheng
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2931 - 2940
  • [49] Applying an Improved Elephant Herding Optimization Algorithm with Spark-based Parallelization to Feature Selection for Intrusion Detection
    Xu H.
    Cao Q.
    Fu H.
    Chen H.
    International Journal of Performability Engineering, 2019, 15 (06) : 1600 - 1610
  • [50] Reduced Overhead Distributed Consensus-Based Estimation Algorithm
    Shin, Ban-Sok
    Paul, Henning
    Wuebben, Dirk
    Dekorsy, Armin
    2013 IEEE GLOBECOM WORKSHOPS (GC WKSHPS), 2013, : 778 - 783