Model-based clustering and outlier detection with missing data

被引:0
|
作者
Hung Tong
Cristina Tortora
机构
[1] San José State University,
关键词
Model-based clustering; Data missing at random; Contaminated normal distribution; Outliers; 62H30;
D O I
暂无
中图分类号
学科分类号
摘要
The use of the multivariate contaminated normal (MCN) distribution in model-based clustering is recommended to cluster data characterized by mild outliers, the model can at the same time detect outliers automatically and produce robust parameter estimates in each cluster. However, one of the limitations of this approach is that it requires complete data, i.e. the MCN cannot be used directly on data with missing values. In this paper, we develop a framework for fitting a mixture of MCN distributions to incomplete data sets, i.e. data sets with some values missing at random. Parameter estimation is obtained using the expectation-conditional maximization algorithm—a variant of the expectation-maximization algorithm in which the traditional maximization steps are instead replaced by simpler conditional maximization steps. We perform a simulation study to compare the results of our model to a mixture of multivariate normal and Student’s t distributions for incomplete data. The simulation also includes a study on the effect of the percentage of missing data on the performance of the three algorithms. The model is then applied to the Automobile data set (UCI machine learning repository). The results show that, while the Student’s t distribution gives similar classification performance, the MCN works better in detecting outliers with a lower false positive rate of outlier detection. The performance of all the techniques decreases linearly as the percentage of missing values increases.
引用
收藏
页码:5 / 30
页数:25
相关论文
共 50 条
  • [41] On Model-Based Clustering of Directional Data with Heavy Tails
    Yingying Zhang
    Volodymyr Melnykov
    Igor Melnykov
    Journal of Classification, 2023, 40 (3) : 527 - 551
  • [42] Bayesian model-based clustering for longitudinal ordinal data
    Roy Costilla
    Ivy Liu
    Richard Arnold
    Daniel Fernández
    Computational Statistics, 2019, 34 : 1015 - 1038
  • [43] BAYESIAN MODEL-BASED CLUSTERING FOR POPULATIONS OF NETWORK DATA
    Mantziou, Anastasia
    Lunagomez, Simon
    Mitra, Robin
    ANNALS OF APPLIED STATISTICS, 2024, 18 (01): : 266 - 302
  • [44] Model-Based Clustering of Inhomogeneous Paired Comparison Data
    Busse, Ludwig M.
    Buhmann, Joachim M.
    SIMILARITY-BASED PATTERN RECOGNITION: FIRST INTERNATIONAL WORKSHOP, SIMBAD 2011, 2011, 7005 : 207 - 221
  • [45] Cloud Model-based Data Attributes Reduction for Clustering
    Xu Ru-zhi
    Nie Pei-yao
    Lin Pei-guang
    Chu Dong-sheng
    PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON ELECTRONIC COMMERCE AND SECURITY, 2008, : 33 - 36
  • [46] Model-Based Clustering of Mixed Data With Sparse Dependence
    Choi, Young-Geun
    Ahn, Soohyun
    Kim, Jayoun
    IEEE ACCESS, 2023, 11 : 75945 - 75954
  • [47] Model-based clustering of Gaussian copulas for mixed data
    Marbac, Matthieu
    Biernacki, Christophe
    Vandewalle, Vincent
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2017, 46 (23) : 11635 - 11656
  • [48] Penalized model-based clustering of complex functional data
    Nicola Pronello
    Rosaria Ignaccolo
    Luigi Ippoliti
    Sara Fontanella
    Statistics and Computing, 2023, 33
  • [49] Penalized model-based clustering of complex functional data
    Pronello, Nicola
    Ignaccolo, Rosaria
    Ippoliti, Luigi
    Fontanella, Sara
    STATISTICS AND COMPUTING, 2023, 33 (06)
  • [50] Scalable model-based clustering by working on data summaries
    Jin, HD
    Wong, ML
    Leung, KS
    THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, : 91 - 98