Unsupervised discretization by two-dimensional MDL-based histogram

被引:0
|
作者
Lincen Yang
Mitra Baratchi
Matthijs van Leeuwen
机构
[1] Leiden University,Leiden Institute of Advanced Computer Science
来源
Machine Learning | 2023年 / 112卷
关键词
Unsupervised discretization; Histogram model; Density estimation; Exploratory data analysis;
D O I
暂无
中图分类号
学科分类号
摘要
Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalized maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which partitions each dimension alternately and then merges neighboring regions, all using the MDL principle. Experiments on synthetic data show that PALM (1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; (2) approximates well a wide range of partitions outside the model class; (3) converges, in contrast to the state-of-the-art multivariate discretization method IPD. Finally, we apply our algorithm to three spatial datasets, and we demonstrate that, compared to kernel density estimation (KDE), our algorithm not only reveals more detailed density changes, but also fits unseen data better, as measured by the log-likelihood.
引用
收藏
页码:2397 / 2431
页数:34
相关论文
共 50 条
  • [1] Unsupervised discretization by two-dimensional MDL-based histogram
    Yang, Lincen
    Baratchi, Mitra
    van Leeuwen, Matthijs
    [J]. MACHINE LEARNING, 2023, 112 (07) : 2397 - 2431
  • [2] Time Series Discretization via MDL-based Histogram Density Estimation
    Kameya, Yoshitaka
    [J]. 2011 23RD IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2011), 2011, : 732 - 739
  • [3] MDL-Based Hierarchical Clustering
    Markov, Zdravko
    [J]. 2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 471 - 474
  • [4] MDL-based time series clustering
    Rakthanmanon, Thanawin
    Keogh, Eamonn J.
    Lonardi, Stefano
    Evans, Scott
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 33 (02) : 371 - 399
  • [5] MDL-based Fitness for Feature Construction
    Shafti, Leila S.
    Perez, Eduardo
    [J]. GECCO 2007: GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, VOL 1 AND 2, 2007, : 1875 - 1875
  • [6] MDL-based design of vector quantizers
    Bischof, H
    Leonardis, A
    [J]. FOURTEENTH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1 AND 2, 1998, : 891 - 893
  • [7] MDL-based time series clustering
    Thanawin Rakthanmanon
    Eamonn J. Keogh
    Stefano Lonardi
    Scott Evans
    [J]. Knowledge and Information Systems, 2012, 33 : 371 - 399
  • [8] An efficient MDL-based construction of RBF networks
    Leonardis, A
    Bischof, H
    [J]. NEURAL NETWORKS, 1998, 11 (05) : 963 - 973
  • [9] Grammar induction by MDL-based distributional classification
    Guo, YK
    Weng, FL
    Wu, LD
    [J]. NEW DEVELOPMENTS IN PARSING TECHNOLOGY, 2004, : 291 - 306
  • [10] Widening for MDL-Based Retail Signature Discovery
    Gautrais, Clement
    Cellier, Peggy
    van Leeuwen, Matthijs
    Termier, Alexandre
    [J]. ADVANCES IN INTELLIGENT DATA ANALYSIS XVIII, IDA 2020, 2020, 12080 : 197 - 209