Model-based clustering and outlier detection with missing data

被引:5
|
作者
Tong, Hung [1 ]
Tortora, Cristina [1 ]
机构
[1] San Jose State Univ, San Jose, CA 95192 USA
关键词
Model-based clustering; Data missing at random; Contaminated normal distribution; Outliers; MAXIMUM-LIKELIHOOD; INCOMPLETE DATA; EM ALGORITHM; MIXTURES; VALUES; DISTRIBUTIONS; ECM;
D O I
10.1007/s11634-021-00476-1
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The use of the multivariate contaminated normal (MCN) distribution in model-based clustering is recommended to cluster data characterized by mild outliers, the model can at the same time detect outliers automatically and produce robust parameter estimates in each cluster. However, one of the limitations of this approach is that it requires complete data, i.e. the MCN cannot be used directly on data with missing values. In this paper, we develop a framework for fitting a mixture of MCN distributions to incomplete data sets, i.e. data sets with some values missing at random. Parameter estimation is obtained using the expectation-conditional maximization algorithm-a variant of the expectation-maximization algorithm in which the traditional maximization steps are instead replaced by simpler conditional maximization steps. We perform a simulation study to compare the results of our model to a mixture of multivariate normal and Student's t distributions for incomplete data. The simulation also includes a study on the effect of the percentage of missing data on the performance of the three algorithms. The model is then applied to the Automobile data set (UCI machine learning repository). The results show that, while the Student's t distribution gives similar classification performance, the MCN works better in detecting outliers with a lower false positive rate of outlier detection. The performance of all the techniques decreases linearly as the percentage of missing values increases.
引用
收藏
页码:5 / 30
页数:26
相关论文
共 50 条
  • [1] Model-based clustering and outlier detection with missing data
    Hung Tong
    Cristina Tortora
    [J]. Advances in Data Analysis and Classification, 2022, 16 : 5 - 30
  • [2] Missing Values and Directional Outlier Detection in Model-Based Clustering
    Tong, Hung
    Tortora, Cristina
    [J]. JOURNAL OF CLASSIFICATION, 2023,
  • [3] A Model-based Approach for Text Clustering with Outlier Detection
    Yin, Jianhua
    Wang, Jianyong
    [J]. 2016 32ND IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2016, : 625 - 636
  • [4] Model-based clustering with missing not at random data
    Sportisse, Aude
    Marbac, Matthieu
    Laporte, Fabien
    Celeux, Gilles
    Boyer, Claire
    Josse, Julie
    Biernacki, Christophe
    [J]. STATISTICS AND COMPUTING, 2024, 34 (04)
  • [5] Model-based Outlier Detection for Object-Relational Data
    Riahi, Fatemeh
    Schulte, Oliver
    [J]. 2015 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI), 2015, : 1590 - 1598
  • [6] Model-based clustering of multivariate skew data with circular components and missing values
    Lagona, Francesco
    Picone, Marco
    [J]. JOURNAL OF APPLIED STATISTICS, 2012, 39 (05) : 927 - 945
  • [7] Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets
    Huang, Min-Wei
    Lin, Wei-Chao
    Tsai, Chih-Fong
    [J]. JOURNAL OF HEALTHCARE ENGINEERING, 2018, 2018
  • [8] A Model-Based Approach for Outlier Detection in Sensor Networks
    Ding, Min
    Liang, Qilian
    Cheng, Xiuzhen
    Al-Rodhaan, Mznah
    Al-Dhelaan, Abdullah
    Huang, Scott C. -H.
    Chen, Dechang
    [J]. AD HOC & SENSOR WIRELESS NETWORKS, 2011, 12 (3-4) : 275 - 293
  • [9] Model-Based Outlier Detection System with Statistical Preprocessing
    Singh, Asir Antony Gnana
    Leavline, Jebalamar
    [J]. JOURNAL OF MODERN APPLIED STATISTICAL METHODS, 2016, 15 (01) : 789 - 801
  • [10] A Mixture Model-Based Combination Approach for Outlier Detection
    Bouguessa, Mohamed
    [J]. INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2014, 23 (04)