FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters

被引:41
|
作者
Xun, Yaling [1 ]
Zhang, Jifu [1 ]
Qin, Xiao [2 ]
Zhao, Xujun [1 ]
机构
[1] Taiyuan Univ Sci & Technol, Taiyuan 030024, Shanxi, Peoples R China
[2] Auburn Univ, Dept Comp Sci & Software Engn, Samuel Ginn Coll Engn, Auburn, AL 36849 USA
基金
美国国家科学基金会;
关键词
Frequent itemset mining; parallel data mining; data partitioning; mapreduce programming model; hadoop cluster; MAPREDUCE; PARALLEL;
D O I
10.1109/TPDS.2016.2560176
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing network and computing loads by the virtue of eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-pattern scheme by up to 31 percent with an average of 18 percent.
引用
收藏
页码:101 / 114
页数:14
相关论文
共 50 条
  • [1] Frequent Itemset Mining on Hadoop
    Ferenc Kovacs
    Illes, Janos
    [J]. IEEE 9TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL CYBERNETICS (ICCC 2013), 2013, : 241 - 245
  • [2] Implementation of an Improved Algorithm for Frequent Itemset Mining using Hadoop
    Agarwal, Ruchi
    Singh, Sunny
    Vats, Satvik
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND AUTOMATION (ICCCA), 2016, : 13 - 18
  • [3] Frequent Itemset Mining for Big Data
    Chavan, Kiran
    Kulkarni, Priyanka
    Ghodekar, Pooja
    Patil, S. N.
    [J]. 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), 2015, : 1365 - 1368
  • [4] Frequent Itemset Mining for Big Data
    Moens, Sandy
    Aksehirli, Emin
    Goethals, Bart
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
  • [5] Parallel Frequent Itemset Mining on Streaming Data
    He, Yanshan
    Yue, Min
    [J]. 2014 10TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2014, : 725 - 730
  • [6] A data mining proxy approach for efficient frequent itemset mining
    Jeffrey Xu Yu
    Zhiheng Li
    Guimei Liu
    [J]. The VLDB Journal, 2008, 17 : 947 - 970
  • [7] A data mining proxy approach for efficient frequent itemset mining
    Yu, Jeffrey Xu
    Li, Zhiheng
    Liu, Guimei
    [J]. VLDB JOURNAL, 2008, 17 (04): : 947 - 970
  • [8] Hardware Architectures for Frequent Itemset Mining Based on Equivalence Classes Partitioning
    Letras, Martin
    Hernandez-Leon, Raudel
    Cumplido, Rene
    [J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 289 - 294
  • [9] An efficient algorithm for frequent itemset mining on data streams
    Xie Zhi-Jun
    Chen Hong
    Li, Cuiping
    [J]. ADVANCES IN DATA MINING: APPLICATIONS IN MEDICINE, WEB MINING, MARKETING, IMAGE AND SIGNAL MINING, 2006, 4065 : 474 - 491
  • [10] Anytime Frequent Itemset Mining of Transactional Data Streams
    Goyal, Poonam
    Challa, Jagat Sesh
    Shrivastava, Shivin
    Goyal, Navneet
    [J]. BIG DATA RESEARCH, 2020, 21