Clustering very large databases using EM mixture models

被引:0
|
作者
Bradley, PS
Fayyad, UM
Reina, CA
机构
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering very large databases is a challenge for traditional pattern recognition algorithms, e.g. the Expectation-Maximization (EM) algorithm for fitting mixture models, because of high memory and iteration requirements. Over large databases, the cost of the numerous scans required to converge and large memory requirements of the algorithm becomes prohibitive. We present a decomposition of the EM algorithm requiring a small amount of memory by limiting iterations to small data subsets. The scalable EM approach requires at most one database scan and is based on identifying regions of the data that are discardable, regions that are compressible, and regions that must be maintained in memory. Data resolution is preserved to the extent possible based upon the size of the memory buffer and fit of the current model to the data. Computational tests demonstrate that the scalable scheme outperforms similarly constrained EM approaches.
引用
收藏
页码:76 / 80
页数:3
相关论文
共 50 条
  • [1] Hybridized Fragmentation of Very Large Databases Using Clustering
    Harikumar, Sandhya
    Ramachandran, Raji
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, INFORMATICS, COMMUNICATION AND ENERGY SYSTEMS (SPICES), 2015,
  • [2] Clustering and validation for very large databases (VLDB)
    Momin, Bashirahamad Fardin
    [J]. 2006 INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, 2007, : 258 - 263
  • [3] Reinforced EM Algorithm for Clustering with Gaussian Mixture Models
    Tobin, Joshua
    Ho, Chin Pang
    Zhang, Mimi
    [J]. PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2023, : 118 - 126
  • [4] A robust EM clustering algorithm for Gaussian mixture models
    Yang, Miin-Shen
    Lai, Chien-Yo
    Lin, Chih-Ying
    [J]. PATTERN RECOGNITION, 2012, 45 (11) : 3950 - 3961
  • [5] Short documents clustering in very large text databases
    Wang, Yongheng
    Jia, Yan
    Yang, Shuqiang
    [J]. WEB INFORMATION SYSTEMS - WISE 2006 WORKSHOPS, PROCEEDINGS, 2006, 4256 : 83 - 93
  • [6] Clustering in very large databases based on distance and density
    Qian, WN
    Gong, XQ
    Zhou, AY
    [J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2003, 18 (01): : 67 - 76
  • [7] Clustering in very large databases based on distance and density
    Weining Qian
    XueQing Gong
    AoYing Zhou
    [J]. Journal of Computer Science and Technology, 2003, 18 : 67 - 76
  • [8] Learning a Mixture of Sparse Models by EM Algorithm for Object Clustering
    Fang, Yuhan
    Jiang, Ruojing
    Li, Chenguang
    [J]. PROCEEDINGS OF 2015 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2015), 2015, : 594 - 597
  • [9] A Variational EM Acceleration for Efficient Clustering at Very Large Scales
    Hirschberger, Florian
    Forster, Dennis
    Luecke, Joerg
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 9787 - 9801
  • [10] Very fast EM-based mixture model clustering using multiresolution kd-trees
    Moore, AW
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 11, 1999, 11 : 543 - 549