We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.
机构:
Capital Univ Econ & Business, Sch Stat, Beijing 10007D, Peoples R China
Yancheng Teachers Univ, Sch Math & Stat, Yancheng, Peoples R ChinaCapital Univ Econ & Business, Sch Stat, Beijing 10007D, Peoples R China
Sun, Huihui
Liu, Qiang
论文数: 0引用数: 0
h-index: 0
机构:
Capital Univ Econ & Business, Sch Stat, Beijing 10007D, Peoples R ChinaCapital Univ Econ & Business, Sch Stat, Beijing 10007D, Peoples R China
机构:
Penn State Univ, Dept Stat, University Pk, PA 16802 USAPenn State Univ, Dept Stat, University Pk, PA 16802 USA
Lee, Hyangmin
Li, Jia
论文数: 0引用数: 0
h-index: 0
机构:
Penn State Univ, Dept Stat, University Pk, PA 16802 USA
Natl Sci Fdn, Div Math Sci, Arlington, VA 22230 USAPenn State Univ, Dept Stat, University Pk, PA 16802 USA
机构:
UR016, Institut de Recherche pour le Développement (IRD), Laboratoire de Mathématique d'Orsay (LMO), Ecole Nationale Supérieure Polytechnique de Yaoundé, 91405 Orsay CedexUR016, Institut de Recherche pour le Développement (IRD), Laboratoire de Mathématique d'Orsay (LMO), Ecole Nationale Supérieure Polytechnique de Yaoundé, 91405 Orsay Cedex
Toussile W.
Gassiat E.
论文数: 0引用数: 0
h-index: 0
机构:
Laboratoire de Mathématique d'Orsay, 91405 Orsay CedexUR016, Institut de Recherche pour le Développement (IRD), Laboratoire de Mathématique d'Orsay (LMO), Ecole Nationale Supérieure Polytechnique de Yaoundé, 91405 Orsay Cedex
机构:
Univ Castilla La Mancha, Dept Econ & Finance, Ciudad Real 13071, Spain
Inst Desarrollo Reg, Albacete, SpainUniv Castilla La Mancha, Dept Econ & Finance, Ciudad Real 13071, Spain
Garcia-Donato, Gonzalo
Paulo, Rui
论文数: 0引用数: 0
h-index: 0
机构:
Univ Lisbon, Lisbon Sch Econ & Management, CEMAPRE REM, Lisbon, Portugal
Univ Lisbon, Lisbon Sch Econ & Management, Dept Math, Lisbon, PortugalUniv Castilla La Mancha, Dept Econ & Finance, Ciudad Real 13071, Spain