FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters

被引：41

作者：

Xun, Yaling ^{[1
]}

Zhang, Jifu ^{[1
]}

Qin, Xiao ^{[2
]}

Zhao, Xujun ^{[1
]}

机构：

[1] Taiyuan Univ Sci & Technol, Taiyuan 030024, Shanxi, Peoples R China

[2] Auburn Univ, Dept Comp Sci & Software Engn, Samuel Ginn Coll Engn, Auburn, AL 36849 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2017年 / 28卷 / 01期

基金：

美国国家科学基金会;

关键词：

Frequent itemset mining; parallel data mining; data partitioning; mapreduce programming model; hadoop cluster; MAPREDUCE; PARALLEL;

D O I：

10.1109/TPDS.2016.2560176

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing network and computing loads by the virtue of eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-pattern scheme by up to 31 percent with an average of 18 percent.

引用

页码：101 / 114

页数：14

共 50 条

[1] Frequent Itemset Mining on Hadoop
Ferenc Kovacs
Illes, Janos
[J]. IEEE 9TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL CYBERNETICS (ICCC 2013), 2013, : 241 - 245
[2] Implementation of an Improved Algorithm for Frequent Itemset Mining using Hadoop
Agarwal, Ruchi
Singh, Sunny
Vats, Satvik
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND AUTOMATION (ICCCA), 2016, : 13 - 18
[3] Frequent Itemset Mining for Big Data
Chavan, Kiran
Kulkarni, Priyanka
Ghodekar, Pooja
Patil, S. N.
[J]. 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), 2015, : 1365 - 1368
[4] Frequent Itemset Mining for Big Data
Moens, Sandy
Aksehirli, Emin
Goethals, Bart
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
[5] Parallel Frequent Itemset Mining on Streaming Data
He, Yanshan
Yue, Min
[J]. 2014 10TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2014, : 725 - 730
[6] A data mining proxy approach for efficient frequent itemset mining
Jeffrey Xu Yu
Zhiheng Li
Guimei Liu
[J]. The VLDB Journal, 2008, 17 : 947 - 970
[7] A data mining proxy approach for efficient frequent itemset mining
Yu, Jeffrey Xu
Li, Zhiheng
Liu, Guimei
[J]. VLDB JOURNAL, 2008, 17 (04): : 947 - 970
[8] Hardware Architectures for Frequent Itemset Mining Based on Equivalence Classes Partitioning
Letras, Martin
Hernandez-Leon, Raudel
Cumplido, Rene
[J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 289 - 294
[9] An efficient algorithm for frequent itemset mining on data streams
Xie Zhi-Jun
Chen Hong
Li, Cuiping
[J]. ADVANCES IN DATA MINING: APPLICATIONS IN MEDICINE, WEB MINING, MARKETING, IMAGE AND SIGNAL MINING, 2006, 4065 : 474 - 491
[10] Anytime Frequent Itemset Mining of Transactional Data Streams
Goyal, Poonam
Challa, Jagat Sesh
Shrivastava, Shivin
Goyal, Navneet
[J]. BIG DATA RESEARCH, 2020, 21

← 1 2 3 4 5 →