Mining top-K frequent itemsets through progressive sampling

被引:28
|
作者
Pietracaprina, Andrea [2 ]
Riondato, Matteo [1 ]
Upfal, Eli [1 ]
Vandin, Fabio [1 ]
机构
[1] Brown Univ, Dept Comp Sci, Providence, RI 02912 USA
[2] Univ Padua, Dipartimento Ingn Informaz, Padua, Italy
基金
美国国家科学基金会;
关键词
Sampling; Top-K frequent itemsets; Frequent itemsets mining; Bloom filters; Progressive sampling;
D O I
10.1007/s10618-010-0185-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real benchmark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.
引用
收藏
页码:310 / 326
页数:17
相关论文
共 50 条
  • [1] Mining top-K frequent itemsets through progressive sampling
    Andrea Pietracaprina
    Matteo Riondato
    Eli Upfal
    Fabio Vandin
    [J]. Data Mining and Knowledge Discovery, 2010, 21 : 310 - 326
  • [2] Efficient algorithms of mining top-k frequent closed itemsets
    Lan Yongjie
    Qiu Yong
    [J]. ICEMI 2007: PROCEEDINGS OF 2007 8TH INTERNATIONAL CONFERENCE ON ELECTRONIC MEASUREMENT & INSTRUMENTS, VOL II, 2007, : 551 - 554
  • [3] Mining top-K frequent itemsets from data streams
    Wong, Raymond Chi-Wing
    Fu, Ada Wai-Chee
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2006, 13 (02) : 193 - 217
  • [4] Efficient incremental mining of top-K frequent closed itemsets
    Pietracaprina, Andrea
    Vandin, Fabio
    [J]. DISCOVERY SCIENCE, PROCEEDINGS, 2007, 4755 : 275 - +
  • [5] Mining top-K frequent itemsets from data streams
    Raymond Chi-Wing Wong
    Ada Wai-Chee Fu
    [J]. Data Mining and Knowledge Discovery, 2006, 13 : 193 - 217
  • [6] TFP: An efficient algorithm for mining top-K frequent closed itemsets
    Wang, JY
    Han, JW
    Lu, Y
    Tzvetkov, P
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (05) : 652 - 664
  • [7] Mining Frequent Itemsets through Progressive Sampling with Rademacher Averages
    Riondato, Matteo
    Upfal, Eli
    [J]. KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 1005 - 1014
  • [8] Interactive mining of top-K frequent closed itemsets from data streams
    Li, Hua-Fu
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (07) : 10779 - 10788
  • [9] Parallel mining of top-k frequent itemsets in very large text database
    Wang, YH
    Jia, Y
    Yang, SQ
    [J]. ADVANCES IN WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2005, 3739 : 706 - 712
  • [10] Using Bloom Filters for Mining Top-k Frequent Itemsets in Data Streams
    Kim, Younghee
    Cho, Kyungsoo
    Yoon, Jaeyeol
    Kim, Ieejoon
    Kim, Ungmo
    [J]. SECURE AND TRUST COMPUTING, DATA MANAGEMENT, AND APPLICATIONS, 2011, 186 : 209 - 216