A Spark-based Apriori algorithm with reduced shuffle overhead

被引:0
|
作者
Shashi Raj
Dharavath Ramesh
Krishan Kumar Sethi
机构
[1] Bakhtiyarpur College of Engineering,Department of Computer Science and Engineering
[2] Indian Institute of Technology (ISM),Department of Computer Science and Engineering
来源
关键词
Apache Spark; Apriori algorithm; Large-scale datasets; Shuffle overhead;
D O I
暂无
中图分类号
学科分类号
摘要
Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.
引用
下载
收藏
页码:133 / 151
页数:18
相关论文
共 50 条
  • [1] A Spark-based Apriori algorithm with reduced shuffle overhead
    Raj, Shashi
    Ramesh, Dharavath
    Sethi, Krishan Kumar
    JOURNAL OF SUPERCOMPUTING, 2021, 77 (01): : 133 - 151
  • [2] ASCF: Optimization of the Apriori Algorithm Using Spark-Based Cuckoo Filter Structure
    Alrahwan, Bana Ahmad
    Farouk, Mona
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2024, 2024
  • [3] A Utility-Based Distributed Pattern Mining Algorithm With Reduced Shuffle Overhead
    Kumar, Sunil
    Mohbey, Krishna Kumar
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (01) : 416 - 428
  • [4] Spark-Based Scalable Algorithm for Link Prediction
    Saketh, K.
    Rajeswari, N. Raja
    Keerthana, M. Krishna
    Shaik, Fathimabi
    INNOVATIVE DATA COMMUNICATION TECHNOLOGIES AND APPLICATION, ICIDCA 2021, 2022, 96 : 619 - 635
  • [5] R-Apriori: An Efficient Apriori based Algorithm on Spark
    Rathee, Sanjay
    Kaul, Manohar
    Kashyap, Arti
    PIKM'15: PROCEEDINGS OF THE 8TH PH.D. WORKSHOP IN INFORMATION AND KNOWLEDGE MANAGEMENT, 2015, : 27 - 34
  • [6] Spark-based parallel processing whale optimization algorithm
    Alshayeji, Mohammad
    Behbehani, Bader
    Ahmad, Imtiaz
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (04):
  • [7] A Spark-based Incremental Algorithm for Frequent Itemset Mining
    Wen, Haoxing
    Li, Xiaoguang
    Kou, Mingdong
    Tou, Huaixiao
    He, Hengyi
    Yang, Yulu
    BDIOT 2018: PROCEEDINGS OF THE 2018 2ND INTERNATIONAL CONFERENCE ON BIG DATA AND INTERNET OF THINGS, 2018, : 53 - 58
  • [8] A Spark-Based Parallel Implementation of Arithmetic Optimization Algorithm
    AlJame, Maryam
    Alnoori, Aisha
    Alfailakawi, Mohammad G.
    Ahmad, Imtiaz
    INTERNATIONAL JOURNAL OF APPLIED METAHEURISTIC COMPUTING, 2023, 14 (01)
  • [9] Spark-based Parallel Collaborative Filtering Recommendation Algorithm
    Yang, Yongli
    Xue, Fei
    Cai, Yongquan
    Ning, Zhenhu
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING, INFORMATION SCIENCE & APPLICATION TECHNOLOGY (ICCIA 2017), 2017, 74 : 987 - 990
  • [10] Spark-based Feature Selection Algorithm of Network Traffic Classification
    Ke, Wenlong
    Wang, Yong
    Lei, Xiaochun
    Wei, Bizhong
    2017 13TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2017, : 140 - 144