FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters

被引:41
|
作者
Xun, Yaling [1 ]
Zhang, Jifu [1 ]
Qin, Xiao [2 ]
Zhao, Xujun [1 ]
机构
[1] Taiyuan Univ Sci & Technol, Taiyuan 030024, Shanxi, Peoples R China
[2] Auburn Univ, Dept Comp Sci & Software Engn, Samuel Ginn Coll Engn, Auburn, AL 36849 USA
基金
美国国家科学基金会;
关键词
Frequent itemset mining; parallel data mining; data partitioning; mapreduce programming model; hadoop cluster; MAPREDUCE; PARALLEL;
D O I
10.1109/TPDS.2016.2560176
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing network and computing loads by the virtue of eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-pattern scheme by up to 31 percent with an average of 18 percent.
引用
下载
收藏
页码:101 / 114
页数:14
相关论文
共 50 条
  • [21] Iterative sampling based frequent itemset mining for big data
    Wu, Xian
    Fan, Wei
    Peng, Jing
    Zhang, Kun
    Yu, Yong
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2015, 6 (06) : 875 - 882
  • [22] AnyFI: An Anytime Frequent Itemset Mining Algorithm for Data Streams
    Goyal, Poonam
    Challa, Jagat Sesh
    Shrivastava, Shivin
    Goyal, Navneet
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 942 - 947
  • [23] Probabilistic frequent itemset mining over uncertain data streams
    Li, Haifeng
    Zhang, Ning
    Zhu, Jianming
    Wang, Yue
    Cao, Huaihu
    EXPERT SYSTEMS WITH APPLICATIONS, 2018, 112 : 274 - 287
  • [24] Efficient Frequent Itemset Mining from Dense Data Streams
    Cuzzocrea, Alfredo
    Jiang, Fan
    Lee, Wookey
    Leung, Carson K.
    WEB TECHNOLOGIES AND APPLICATIONS, APWEB 2014, 2014, 8709 : 593 - 601
  • [25] Constrained Frequent Itemset Mining from Uncertain Data Streams
    Leung, Carson Kai-Sang
    Hao, Boyu
    Jiang, Fan
    2010 IEEE 26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDE 2010), 2010, : 120 - 127
  • [26] MrFIM: A MapReduce Approach for Frequent Itemset Mining in Big Data
    Rahman, Abdul
    Manjaramkar, Arati
    2018 4TH INTERNATIONAL CONFERENCE FOR CONVERGENCE IN TECHNOLOGY (I2CT), 2018,
  • [27] An algorithm for in-core frequent itemset mining on streaming data
    Jin, RM
    Agrawal, G
    FIFTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2005, : 210 - 217
  • [28] A Review on Frequent Itemset Mining Algorithms in Social Network Data
    Dharsandiya, Ankit N.
    Patel, Mihir R.
    PROCEEDINGS OF THE 2016 IEEE INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), 2016, : 1046 - 1048
  • [29] Fast Algorithms for Frequent Itemset Mining from Uncertain Data
    Leung, Carson Kai-Sang
    MacKinnon, Richard Kyle
    Tanbeer, Syed K.
    2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2014, : 893 - 898
  • [30] Spatio-Temporal Frequent Itemset Mining on Web Data
    Aggarwal, Apeksha
    Toshniwal, Durga
    2018 18TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2018, : 1160 - 1165