Mining top-K frequent itemsets through progressive sampling

被引：28

作者：

Pietracaprina, Andrea ^{[2
]}

Riondato, Matteo ^{[1
]}

Upfal, Eli ^{[1
]}

Vandin, Fabio ^{[1
]}

机构：

[1] Brown Univ, Dept Comp Sci, Providence, RI 02912 USA

[2] Univ Padua, Dipartimento Ingn Informaz, Padua, Italy

来源：

DATA MINING AND KNOWLEDGE DISCOVERY | 2010年 / 21卷 / 02期

基金：

美国国家科学基金会;

关键词：

Sampling; Top-K frequent itemsets; Frequent itemsets mining; Bloom filters; Progressive sampling;

D O I：

10.1007/s10618-010-0185-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real benchmark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.

引用

页码：310 / 326

页数：17

共 50 条

[41] Mining top-k frequent-regular closed patterns
Amphawan, Komate
Lenca, Philippe
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (21) : 7882 - 7894
[42] Mining top-k frequent patterns in the presence of the memory constraint
Kun-Ta Chuang
Jiun-Long Huang
Ming-Syan Chen
[J]. The VLDB Journal, 2008, 17 : 1321 - 1344
[43] Mining top-k frequent patterns with combination reducing techniques
Pyun, Gwangbum
Yun, Unil
[J]. APPLIED INTELLIGENCE, 2014, 41 (01) : 76 - 98
[44] Efficient Top-k Frequent Itemset Mining on Massive Data
Wan, Xiaolong
Han, Xixian
[J]. DATA SCIENCE AND ENGINEERING, 2024, 9 (02) : 177 - 203
[45] Mining top-k frequent patterns from uncertain databases
Tuong Le
Bay Vo
Van-Nam Huynh
Ngoc Thanh Nguyen
Baik, Sung Wook
[J]. APPLIED INTELLIGENCE, 2020, 50 (05) : 1487 - 1497
[46] Mining top-k frequent patterns with combination reducing techniques
Gwangbum Pyun
Unil Yun
[J]. Applied Intelligence, 2014, 41 : 76 - 98
[47] Mining Top-k Regular High-Utility Itemsets in Transactional Databases
Kumari, P. Lalitha
Sanjeevi, S. G.
Rao, T. V. Madhusudhana
[J]. INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2019, 15 (01) : 58 - 79
[48] An efficient algorithm for mining top-k on-shelf high utility itemsets
Thu-Lan Dam
Li, Kenli
Fournier-Viger, Philippe
Quang-Huy Duong
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2017, 52 (03) : 621 - 655
[49] Mining top-k high utility itemsets with effective threshold raising strategies
Krishnamoorthy, Srikumar
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 117 : 148 - 165
[50] DEVELOPMENT OF AN EFFICIENT TECHNIQUE FOR MINING TOP-K CLOSED HIGH UTILITY ITEMSETS
Velayudhan, Baby
Sakthivel
Subasree
[J]. IIOAB JOURNAL, 2016, 7 (09) : 150 - 155

← 1 2 3 4 5 →