Clustering and variable selection in the presence of mixed variable types and missing data

被引：13

作者：

Storlie, C. B. ^{[1
]}

Myers, S. M. ^{[3
]}

Katusic, S. K. ^{[1
]}

Weaver, A. L. ^{[1
]}

Voigt, R. G. ^{[2
]}

Croarkin, P. E. ^{[1
]}

Stoeckel, R. E. ^{[1
]}

Port, J. D. ^{[1
]}

机构：

[1] Mayo Clin, Rochester, MN 55905 USA

[2] Texas Childrens Hosp, Houston, TX 77030 USA

[3] Geisinger Autism & Dev Med Inst, Lewisburg, PA USA

来源：

STATISTICS IN MEDICINE | 2018年 / 37卷 / 19期

关键词：

Dirichlet process; hierarchical Bayesian modeling; model-based clustering; missing data; mixed variable types; variable selection; DIRICHLET PROCESS MIXTURE; MULTINOMIAL PROBIT MODEL; BAYESIAN-ANALYSIS; PRIOR DISTRIBUTIONS; DENSITY-ESTIMATION; CATEGORICAL-DATA; SAMPLING METHODS; PARAMETERS; LIKELIHOOD; REGULARIZATION;

D O I：

10.1002/sim.7697

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.

引用

页码：2884 / 2899

页数：16

共 50 条

[1] Flexible variable selection in the presence of missing data
Williamson, Brian D.
Huang, Ying
INTERNATIONAL JOURNAL OF BIOSTATISTICS, 2024, 20 (02): : 347 - 359
[2] Variable selection in the presence of missing data: resampling and imputation
Long, Qi
Johnson, Brent A.
BIOSTATISTICS, 2015, 16 (03) : 596 - 610
[3] Missing Data Imputation for a Multivariate Outcome of Mixed Variable Types
Wang, Tuo
Zilinskas, Rachel
Li, Ying
Qu, Yongming
STATISTICS IN BIOPHARMACEUTICAL RESEARCH, 2023, 15 (04): : 826 - 837
[4] Variable Selection for Mixed Data Clustering: Application in Human Population Genomics
Matthieu Marbac
Mohammed Sedki
Tienne Patin
Journal of Classification, 2020, 37 : 124 - 142
[5] Variable Selection for Mixed Data Clustering: Application in Human Population Genomics
Marbac, Matthieu
Sedki, Mohammed
Patin, Tienne
JOURNAL OF CLASSIFICATION, 2020, 37 (01) : 124 - 142
[6] Latent variable mixed models with missing data
Zare, N
Ayatollahi, SMT
Behboodian, J
IRANIAN JOURNAL OF SCIENCE AND TECHNOLOGY, 2003, 27 (A2): : 407 - 416
[7] VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA
Garcia, Ramon I.
Ibrahim, Joseph G.
Zhu, Hongtu
STATISTICA SINICA, 2010, 20 (01) : 149 - 165
[8] Variable selection for latent class analysis in the presence of missing data with application to record linkage
Xu, Huiping
Li, Xiaochun
Zhang, Zuoyi
Grannis, Shaun
STATISTICAL METHODS IN MEDICAL RESEARCH, 2024, 33 (06) : 966 - 980
[9] Clustering and variable selection for categorical multivariate data
Bontemps, Dominique
Toussile, Wilson
ELECTRONIC JOURNAL OF STATISTICS, 2013, 7 : 2344 - 2371
[10] A mixed integer linear model for clustering with variable selection
Benati, Stefano
Garcia, Sergio
COMPUTERS & OPERATIONS RESEARCH, 2014, 43 : 280 - 285

← 1 2 3 4 5 →