A scalable and flexible basket analysis system for big transaction data in Spark

被引:5
|
作者
Sun, Xudong [1 ,2 ]
Ngueilbaye, Alladoumbaye [1 ,2 ]
Luo, Kaijing [1 ,2 ]
Cai, Yongda [1 ,2 ]
Wu, Dingming [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ,3 ]
机构
[1] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Big Data Inst, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
[3] Guangdong Lab Artificial Intelligence & Digital Ec, Shenzhen 518107, Peoples R China
基金
中国国家自然科学基金;
关键词
Big transaction data; Frequent itemset mining; Parallel and distributed computing; Business basket analysis; Basket analysis systems; FP-GROWTH; FREQUENT; ALGORITHM; PATTERNS;
D O I
10.1016/j.ipm.2023.103577
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Basket analysis is a prevailing technique to help retailers uncover patterns and associations of sold products in customer shopping transactions. However, as the size of transaction databases grows, the traditional basket analysis techniques and systems become less effective because of two issues in the applications of the big data age: data scalability and flexibility to adapt different application tasks. This paper proposes a scalable distributed frequent itemset mining (ScaDistFIM) algorithm for basket analysis on big transaction data to solve these two problems. ScaDistFIM is performed in two stages. The first stage uses the FP-Growth algorithm to compute the local frequent itemsets from each random subset of the distributed transaction dataset, and all random subsets are computed in parallel. The second stage uses an approximation method to aggregate all local frequent itemsets to the final approximate set of frequent itemsets where the support values of the frequent itemsets are estimated. We further elaborate on implementing the ScaDistFIM algorithm and a flexible basket analysis system using Spark SQL queries to demonstrate the system's flexibility in real applications. The experiment results on synthetic and real-world transaction datasets demonstrate that compared to the Spark FP-Growth algorithm, the ScaDistFIM algorithm can achieve time savings of at least 90% while ensuring nearly 100% accuracy. Hence, the ScaDistFIM algorithm exhibits superior scalability. On dataset GenD with 1 billion records, the ScaDistFIM algorithm requires only 360 s to achieve 100% precision and recall. In contrast, due to memory limitations, Spark FP-Growth cannot complete the computation task.
引用
收藏
页数:22
相关论文
共 50 条
  • [41] A Development of Streaming Big Data Analysis System Using In-memory Cluster Computing Framework: Spark
    Park, Kiejin
    Baek, Changwon
    Peng, Limei
    ADVANCED MULTIMEDIA AND UBIQUITOUS ENGINEERING: FUTURETECH & MUE, 2016, 393 : 157 - 163
  • [42] Big data and Spark: Comparison with Hadoop
    Benlachmi, Yassine
    Hasnaoui, Moulay Lahcen
    PROCEEDINGS OF THE 2020 FOURTH WORLD CONFERENCE ON SMART TRENDS IN SYSTEMS, SECURITY AND SUSTAINABILITY (WORLDS4 2020), 2020, : 811 - 817
  • [43] Big data analytics on Apache Spark
    Salloum S.
    Dautov R.
    Chen X.
    Peng P.X.
    Huang J.Z.
    International Journal of Data Science and Analytics, 2016, 1 (3-4) : 145 - 164
  • [44] Discovering Discontinuity in Big Financial Transaction Data
    Tuarob, Suppawong
    Strong, Ray
    Chandra, Anca
    Tucker, Conrad S.
    ACM TRANSACTIONS ON MANAGEMENT INFORMATION SYSTEMS, 2018, 9 (01)
  • [45] A Scalable Evolutionary Linguistic Fuzzy System with Adaptive Defuzzification in Big Data
    Marquez, A. A.
    Marquez, F. A.
    Peregrin, A.
    2017 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2017,
  • [46] Iron and Steel Enterprises Big Data Visualization Analysis Based on Spark
    Ban, Xiaojuan
    Wang, Ben
    Cheng, Changxin
    Taghzouit, Salah
    COOPERATIVE DESIGN, VISUALIZATION, AND ENGINEERING: 15TH INTERNATIONAL CONFERENCE, CDVE 2018, 2018, 11151 : 280 - 286
  • [47] Iron and steel enterprises big data visualization analysis based on spark
    Ban, Xiaojuan
    Wang, Ben
    Cheng, Changxin
    Taghzouit, Salah
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, 11151 LNCS : 280 - 286
  • [48] A Big Data Analysis Framework Using Apache Spark and Deep Learning
    Gupta, Anand
    Thakur, Hardeo Kumar
    Shrivastava, Ritvik
    Kumar, Pulkit
    Nag, Sreyashi
    2017 17TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2017), 2017, : 9 - 16
  • [49] Analysis of Big Data in Healthcare and Life Sciences Using Hive and Spark
    Hanuman, A. Sai
    Soujanya, R.
    Madhuri, P. M.
    DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT-2K19, 2020, 1079 : 825 - 840
  • [50] Transaction processing monitor support for scalable data warehouses
    Ram, P
    ASSOCIATION FOR INFORMATION SYSTEMS PROCEEDING OF THE AMERICAS CONFERENCE ON INFORMATION SYSTEMS, 1997, : 470 - 472