A Spark-based Apriori algorithm with reduced shuffle overhead

被引:0
|
作者
Shashi Raj
Dharavath Ramesh
Krishan Kumar Sethi
机构
[1] Bakhtiyarpur College of Engineering,Department of Computer Science and Engineering
[2] Indian Institute of Technology (ISM),Department of Computer Science and Engineering
来源
关键词
Apache Spark; Apriori algorithm; Large-scale datasets; Shuffle overhead;
D O I
暂无
中图分类号
学科分类号
摘要
Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.
引用
收藏
页码:133 / 151
页数:18
相关论文
共 50 条
  • [31] A Spark-based Artificial Bee Colony Algorithm for Large-scale Data Clustering
    Wang, Yanjie
    Qian, Quan
    IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 1213 - 1218
  • [32] HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing
    Sethi, Krishan Kumar
    Ramesh, Dharavath
    JOURNAL OF SUPERCOMPUTING, 2017, 73 (08): : 3652 - 3668
  • [33] Spark-Based Parallel Method for Prediction of Events
    Rajita, B. S. A. S.
    Ranjan, Yash
    Umesh, Chandekar Tanmay
    Panda, Subhrakanta
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2020, 45 (04) : 3437 - 3453
  • [34] A Spark-based genetic algorithm for sensor placement in large scale drinking water distribution systems
    Chengyu Hu
    Guo Ren
    Chao Liu
    Ming Li
    Wei Jie
    Cluster Computing, 2017, 20 : 1089 - 1099
  • [35] Spark-Based Label Diffusion and Label Selection Community Detection Algorithm for Metagenome Sequence Clustering
    Wu, Zhengjiang
    Wu, Xuyang
    Luo, Junwei
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2023, 16 (01)
  • [36] Spark-based ensemble learning for imbalanced data classification
    Ding J.
    Wang S.
    Jia L.
    You J.
    Jiang Y.
    International Journal of Performability Engineering, 2018, 14 (05) : 945 - 964
  • [37] A novel spark-based multi-step forecasting algorithm for big data time series
    Galicia, A.
    Torres, J. F.
    Martinez-Alvarez, F.
    Troncoso, A.
    INFORMATION SCIENCES, 2018, 467 : 800 - 818
  • [38] A Spark-based genetic algorithm for sensor placement in large scale drinking water distribution systems
    Hu, Chengyu
    Ren, Guo
    Liu, Chao
    Li, Ming
    Jie, Wei
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2017, 20 (02): : 1089 - 1099
  • [39] Spark-Based Label Diffusion and Label Selection Community Detection Algorithm for Metagenome Sequence Clustering
    Zhengjiang Wu
    Xuyang Wu
    Junwei Luo
    International Journal of Computational Intelligence Systems, 16
  • [40] Improve Spark-based Application Performance Using Minimizer
    Wu, Jinda
    Deng, Li
    Wang, Lili
    Li, Kexue
    Lu, Yakang
    Song, Yang
    PROCEEDINGS OF 2020 IEEE 9TH DATA DRIVEN CONTROL AND LEARNING SYSTEMS CONFERENCE (DDCLS'20), 2020, : 595 - 599