Average-case performance of the Apriori Algorithm

被引:16
|
作者
Purdom, PW [1 ]
Van Gucht, D
Groth, DP
机构
[1] Indiana Univ, Dept Comp Sci, Bloomington, IN 47405 USA
[2] Indiana Univ, Sch Informat, Bloomington, IN 47405 USA
关键词
data mining; algorithm analysis; Apriori Algorithm;
D O I
10.1137/S0097539703422881
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The failure rate of the Apriori Algorithm is studied analytically for the case of random shoppers. The time needed by the Apriori Algorithm is determined by the number of item sets that are output (successes: item sets that occur in at least k baskets) and the number of item sets that are counted but not output (failures: item sets where all subsets of the item set occur in at least k baskets but the full set occurs in less than k baskets). The number of successes is a property of the data; no algorithm that is required to output each success can avoid doing work associated with the successes. The number of failures is a property of both the algorithm and the data. We find that under a wide range of conditions the performance of the Apriori Algorithm is almost as bad as is permitted under sophisticated worst-case analyses. In particular, there is usually a bad level with two properties: (1) it is the level where nearly all of the work is done, and (2) nearly all item sets counted are failures. Let l be the level with the most successes, and let the number of successes on level l be approximately ((m)(l)) for some m. Then, typically, the Apriori Algorithm has total output proportional to approximately ((m)(l)) and total work proportional to approximately ((m)(l+1)). In addition m is usually much larger than l, so the ratio of work to output is proportional to approximately m/(l+1). The analytical results for random shoppers are compared against measurements for three data sets. These data sets are more like the usual applications of the algorithm. In particular, the buying patterns of the various shoppers are highly correlated. For most thresholds, these data sets also have a bad level. Thus, under most conditions nearly all of the work done by the Apriori Algorithm consists in counting item sets that fail.
引用
收藏
页码:1223 / 1260
页数:38
相关论文
共 50 条
  • [21] AVERAGE-CASE INTERACTIVE COMMUNICATION
    ORLITSKY, A
    IEEE TRANSACTIONS ON INFORMATION THEORY, 1992, 38 (05) : 1534 - 1547
  • [22] AVERAGE-CASE RESULTS ON HEAPSORT
    CARLSSON, S
    BIT, 1987, 27 (01): : 2 - 17
  • [23] AVERAGE-CASE "MESSY" BROADCASTING
    Li, Chenkuan
    Hart, Thomas E.
    Henry, Kevin J.
    Neufeld, Ian A.
    JOURNAL OF INTERCONNECTION NETWORKS, 2008, 9 (04) : 487 - 505
  • [24] A Self-Timed SRAM Design for Average-Case Performance
    Lee, Je-Hoon
    Song, Young-Jun
    Kim, Sang-Choon
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (08): : 1547 - 1556
  • [25] On the average-case hardness of CVP
    Cai, JY
    42ND ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2001, : 308 - 317
  • [26] On the average-case complexity of Shellsort
    Vitanyi, Paul
    RANDOM STRUCTURES & ALGORITHMS, 2018, 52 (02) : 354 - 363
  • [27] A Note on Average-Case Sorting
    Shay Moran
    Amir Yehudayoff
    Order, 2016, 33 : 23 - 28
  • [28] STRUCTURAL AVERAGE-CASE COMPLEXITY
    SCHULER, R
    YAMAKAMI, T
    LECTURE NOTES IN COMPUTER SCIENCE, 1992, 652 : 128 - 139
  • [29] A Note on Average-Case Sorting
    Moran, Shay
    Yehudayoff, Amir
    ORDER-A JOURNAL ON THE THEORY OF ORDERED SETS AND ITS APPLICATIONS, 2016, 33 (01): : 23 - 28
  • [30] A General Framework for Average-Case Performance Analysis of Shared Resources
    Foroutan, Sahar
    Akesson, Benny
    Goossens, Kees
    Petrot, Frederic
    16TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD 2013), 2013, : 78 - 85