Clustering and variable selection in the presence of mixed variable types and missing data

被引:13
|
作者
Storlie, C. B. [1 ]
Myers, S. M. [3 ]
Katusic, S. K. [1 ]
Weaver, A. L. [1 ]
Voigt, R. G. [2 ]
Croarkin, P. E. [1 ]
Stoeckel, R. E. [1 ]
Port, J. D. [1 ]
机构
[1] Mayo Clin, Rochester, MN 55905 USA
[2] Texas Childrens Hosp, Houston, TX 77030 USA
[3] Geisinger Autism & Dev Med Inst, Lewisburg, PA USA
关键词
Dirichlet process; hierarchical Bayesian modeling; model-based clustering; missing data; mixed variable types; variable selection; DIRICHLET PROCESS MIXTURE; MULTINOMIAL PROBIT MODEL; BAYESIAN-ANALYSIS; PRIOR DISTRIBUTIONS; DENSITY-ESTIMATION; CATEGORICAL-DATA; SAMPLING METHODS; PARAMETERS; LIKELIHOOD; REGULARIZATION;
D O I
10.1002/sim.7697
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.
引用
收藏
页码:2884 / 2899
页数:16
相关论文
共 50 条
  • [21] High-dimensional variable selection in regression and classification with missing data
    Gao, Qi
    Lee, Thomas C. M.
    SIGNAL PROCESSING, 2017, 131 : 1 - 7
  • [22] PENALIZED PAIRWISE PSEUDO LIKELIHOOD FOR VARIABLE SELECTION WITH NONIGNORABLE MISSING DATA
    Zhao, Jiwei
    Yang, Yang
    Ning, Yang
    STATISTICA SINICA, 2018, 28 (04) : 2125 - 2148
  • [23] A case study of normalization, missing data and variable selection methods in lipidomics
    Kujala, M.
    Nevalainen, J.
    STATISTICS IN MEDICINE, 2015, 34 (01) : 59 - 73
  • [24] Investigating Variable Selection Techniques Under Missing Data: A Simulation Study
    Bain, Catherine
    Shi, Dingjing
    QUANTITATIVE PSYCHOLOGY, IMPS 2023, 2024, 452 : 109 - 119
  • [25] Variable Selection Under Missing Values and Unlabeled Data in Semiconductor Processes
    Kim, Kyung-Jun
    Kim, Kyu-Jin
    Jun, Chi-Hyuck
    Chong, Il-Gyo
    Song, Geun-Young
    IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, 2019, 32 (01) : 121 - 128
  • [26] Variable selection for additive models with missing data via multiple imputation
    Yuta Shimazu
    Takayuki Yamaguchi
    Ibuki A. J. Hoshina
    Hidetoshi Matsui
    Behaviormetrika, 2025, 52 (1) : 163 - 178
  • [27] Variable selection in multivariate calibration based on clustering of variable concept
    Farrokhnia, Maryam
    Karimi, Sadegh
    ANALYTICA CHIMICA ACTA, 2016, 902 : 70 - 81
  • [28] SelvarClustMV: Variable selection approach in model-based clustering allowing for missing values
    Maugis-Rabusseau, Cathy
    Martin-Magniette, Marie-Laure
    Pelletier, Sandra
    JOURNAL OF THE SFDS, 2012, 153 (02): : 21 - 36
  • [29] Bayesian clustering of mixed-type data with relevant variable identification
    Burhanuddin, Nurul Afiqah
    Ibrahim, Kamarulzaman
    Adam, Mohd Bakri
    Mustapha, Norwati
    Zulkafli, Hani Syahida
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2024,
  • [30] Hierarchical clustering of mixed variable panel data based on new distance
    Akay, Ozlem
    Yuksel, Guzin
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2021, 50 (06) : 1695 - 1710