A scalable and flexible basket analysis system for big transaction data in Spark

被引:5
|
作者
Sun, Xudong [1 ,2 ]
Ngueilbaye, Alladoumbaye [1 ,2 ]
Luo, Kaijing [1 ,2 ]
Cai, Yongda [1 ,2 ]
Wu, Dingming [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ,3 ]
机构
[1] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Big Data Inst, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
[3] Guangdong Lab Artificial Intelligence & Digital Ec, Shenzhen 518107, Peoples R China
基金
中国国家自然科学基金;
关键词
Big transaction data; Frequent itemset mining; Parallel and distributed computing; Business basket analysis; Basket analysis systems; FP-GROWTH; FREQUENT; ALGORITHM; PATTERNS;
D O I
10.1016/j.ipm.2023.103577
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Basket analysis is a prevailing technique to help retailers uncover patterns and associations of sold products in customer shopping transactions. However, as the size of transaction databases grows, the traditional basket analysis techniques and systems become less effective because of two issues in the applications of the big data age: data scalability and flexibility to adapt different application tasks. This paper proposes a scalable distributed frequent itemset mining (ScaDistFIM) algorithm for basket analysis on big transaction data to solve these two problems. ScaDistFIM is performed in two stages. The first stage uses the FP-Growth algorithm to compute the local frequent itemsets from each random subset of the distributed transaction dataset, and all random subsets are computed in parallel. The second stage uses an approximation method to aggregate all local frequent itemsets to the final approximate set of frequent itemsets where the support values of the frequent itemsets are estimated. We further elaborate on implementing the ScaDistFIM algorithm and a flexible basket analysis system using Spark SQL queries to demonstrate the system's flexibility in real applications. The experiment results on synthetic and real-world transaction datasets demonstrate that compared to the Spark FP-Growth algorithm, the ScaDistFIM algorithm can achieve time savings of at least 90% while ensuring nearly 100% accuracy. Hence, the ScaDistFIM algorithm exhibits superior scalability. On dataset GenD with 1 billion records, the ScaDistFIM algorithm requires only 360 s to achieve 100% precision and recall. In contrast, due to memory limitations, Spark FP-Growth cannot complete the computation task.
引用
收藏
页数:22
相关论文
共 50 条
  • [31] Scalable Mining of Big Data
    Leung, Carson K.
    Pazdor, Adam G. M.
    Zheng, Hao
    2021 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, INTERNET OF PEOPLE, AND SMART CITY INNOVATIONS (SMARTWORLD/SCALCOM/UIC/ATC/IOP/SCI 2021), 2021, : 240 - 247
  • [32] Application of Market Basket Analysis for the Visualization of Transaction Data Based on Human Lifestyle and Spectroscopic Measurements
    Shiokawa, Yuka
    Misawa, Takuma
    Date, Yasuhiro
    Kikuchi, Jun
    ANALYTICAL CHEMISTRY, 2016, 88 (05) : 2714 - 2719
  • [33] Architecture and Implementation of a Scalable Sensor Data Storage and Analysis System Using Cloud Computing and Big Data Technologies
    Aydin, Galip
    Hallac, Ibrahim Riza
    Karakus, Betul
    JOURNAL OF SENSORS, 2015, 2015
  • [34] Efficient and flexible anonymization of transaction data
    Loukides, Grigorios
    Gkoulalas-Divanis, Aris
    Shao, Jianhua
    KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 36 (01) : 153 - 210
  • [35] Efficient and flexible anonymization of transaction data
    Grigorios Loukides
    Aris Gkoulalas-Divanis
    Jianhua Shao
    Knowledge and Information Systems, 2013, 36 : 153 - 210
  • [36] Diabetes Data Prediction Using Spark and Analysis in Hue Over Big Data
    Guttikonda, Geetha
    Katamaneni, Madhavi
    Pandala, MadhaviLatha
    PROCEEDINGS OF THE 2019 3RD INTERNATIONAL CONFERENCE ON COMPUTING METHODOLOGIES AND COMMUNICATION (ICCMC 2019), 2019, : 1112 - 1117
  • [37] CDFRS: A scalable sampling approach for efficient big data analysis
    Cai, Yongda
    Wu, Dingming
    Sun, Xudong
    Wu, Siyue
    Xu, Jingsheng
    Huang, Joshua Zhexue
    INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (04)
  • [38] A Scalable Computing Resources System for Remote Sensing Big Data Processing Using GeoPySpark Based on Spark on K8s
    Guo, Jifu
    Huang, Chunlin
    Hou, Jinliang
    REMOTE SENSING, 2022, 14 (03)
  • [39] Scalable Data Analytics Market Basket Model for Transactional Data Streams
    Izang, Aaron A.
    Goga, Nicolae
    Kuyoro, Shade O.
    Alao, Olujimi D.
    Omotunde, Ayokunle A.
    Adio, Adesina K.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (10) : 61 - 68
  • [40] THE FLEXIBLE EXCHANGE BASKET - A MACROECONOMIC ANALYSIS
    BHANDARI, JS
    JOURNAL OF INTERNATIONAL MONEY AND FINANCE, 1985, 4 (01) : 19 - 41