Clustering very large databases using EM mixture models

被引:0
|
作者
Bradley, PS
Fayyad, UM
Reina, CA
机构
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering very large databases is a challenge for traditional pattern recognition algorithms, e.g. the Expectation-Maximization (EM) algorithm for fitting mixture models, because of high memory and iteration requirements. Over large databases, the cost of the numerous scans required to converge and large memory requirements of the algorithm becomes prohibitive. We present a decomposition of the EM algorithm requiring a small amount of memory by limiting iterations to small data subsets. The scalable EM approach requires at most one database scan and is based on identifying regions of the data that are discardable, regions that are compressible, and regions that must be maintained in memory. Data resolution is preserved to the extent possible based upon the size of the memory buffer and fit of the current model to the data. Computational tests demonstrate that the scalable scheme outperforms similarly constrained EM approaches.
引用
收藏
页码:76 / 80
页数:3
相关论文
共 50 条
  • [41] Scalable Blocking for Very Large Databases
    Borthwick, Andrew
    Ash, Stephen
    Pang, Bin
    Qureshi, Shehzad
    Jones, Timothy
    [J]. ECML PKDD 2020 WORKSHOPS, 2020, 1323 : 303 - 319
  • [42] Active learning in very large databases
    Navneet Panda
    King-Shy Goh
    Edward Y. Chang
    [J]. Multimedia Tools and Applications, 2006, 31 : 249 - 267
  • [43] Clustering Large Databases in Distributed Environment
    Pakhira, Malay K.
    [J]. 2009 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE, VOLS 1-3, 2009, : 351 - 358
  • [44] Clustering of Short Strings in Large Databases
    Kazimianec, Michail
    Mazeika, Arturas
    [J]. PROCEEDINGS OF THE 20TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATION, 2009, : 368 - +
  • [45] Association rules in very large databases
    不详
    [J]. ASSOCIATION RULE MINING: MODELS AND ALGORITHMS, 2002, 2307 : 161 - 198
  • [46] A clustering method for large spatial databases
    Schoier, G
    Borruso, G
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2004, PT 2, 2004, 3044 : 1089 - 1095
  • [47] Learning rules from very large databases using rough multisets
    Chan, CC
    [J]. TRANSACTIONS ON ROUGH SETS I, 2004, 3100 : 59 - 77
  • [48] Clustering with block mixture models
    Govaert, G
    Nadif, M
    [J]. PATTERN RECOGNITION, 2003, 36 (02) : 463 - 473
  • [49] Parallel processing of very large databases using distributed column indexes
    E. V. Ivanova
    L. B. Sokolinsky
    [J]. Programming and Computer Software, 2017, 43 : 131 - 144
  • [50] Parallel Processing of Very Large Databases Using Distributed Column Indexes
    Ivanova, E. V.
    Sokolinsky, L. B.
    [J]. PROGRAMMING AND COMPUTER SOFTWARE, 2017, 43 (03) : 131 - 144