A Spark-based Apriori algorithm with reduced shuffle overhead

被引：0

作者：

Shashi Raj

Dharavath Ramesh

Krishan Kumar Sethi

机构：

[1] Bakhtiyarpur College of Engineering,Department of Computer Science and Engineering

[2] Indian Institute of Technology (ISM),Department of Computer Science and Engineering

来源：

The Journal of Supercomputing | 2021年 / 77卷

关键词：

Apache Spark; Apriori algorithm; Large-scale datasets; Shuffle overhead;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.

引用

页码：133 / 151

页数：18

共 50 条

[21] CrossFIM: a spark-based hybrid frequent itemset mining algorithm for large datasets
Shashi Raj
Dharavath Ramesh
Prabhakar Gantela
Cluster Computing, 2025, 28 (4)
[22] Spark-Based Port and Net Scan Detection
Affinito, Antonia
Botta, Alessio
Gallo, Luigi
Garofalo, Mauro
Ventre, Giorgio
PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING (SAC'20), 2020, : 1172 - 1179
[23] A data structure perspective to the RDD-based Apriori algorithm on Spark
Singh P.
Singh S.
Mishra P.K.
Garg R.
International Journal of Information Technology, 2022, 14 (3) : 1585 - 1594
[24] Apriori algorithm optimization based on Spark platform under big data
Yu, Huafeng
MICROPROCESSORS AND MICROSYSTEMS, 2021, 80
[25] Spark-Based Parallel Method for Prediction of Events
B. S. A. S. Rajita
Yash Ranjan
Chandekar Tanmay Umesh
Subhrakanta Panda
Arabian Journal for Science and Engineering, 2020, 45 : 3437 - 3453
[26] Leveraging spark-based machine learning algorithm for audience sentiment analysis in youtube content
Subha, K.
Bharathi, N.
Intelligent Data Analysis, 2024, 28 (05) : 1395 - 1405
[27] Accelerating Spark-Based Applications with MPI and OpenACC
Alshahrani, Saeed
Al Shehri, Waleed
Almalki, Jameel
Alghamdi, Ahmed M.
Alammari, Abdullah M.
COMPLEXITY, 2021, 2021
[28] A Spark-based Ant Lion Algorithm for Parameters Optimization of Random Forest in Credit Classification
Chen, Hongwei
Chang, Pengyang
Hu, Zhou
Fu, Heng
Yan, Lingyu
PROCEEDINGS OF 2019 IEEE 3RD INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2019), 2019, : 992 - 996
[29] HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing
Krishan Kumar Sethi
Dharavath Ramesh
The Journal of Supercomputing, 2017, 73 : 3652 - 3668
[30] Spark-based Parallel Cooperative Co-evolution Particle Swarm Optimization Algorithm
Cao, Bin
Li, Weiqiang
Zhao, Jianwei
Yang, Shan
Kang, Xinyuan
Ling, Yingbiao
Lv, Zhihan
2016 IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES (ICWS), 2016, : 570 - 577

← 1 2 3 4 5 →