Variable Selection for Mixed Data Clustering: Application in Human Population Genomics

被引:11
|
作者
Marbac, Matthieu [1 ]
Sedki, Mohammed [2 ]
Patin, Tienne [3 ]
机构
[1] Ensai, CREST, Bruz, France
[2] Univ Paris Sud, UMR Inserm 1181, Orsay, France
[3] Inst Pasteur, CNRS, URA3012, Paris, France
关键词
Human evolutionary genetics; Information criterion; Mixed data; Model-based clustering; Variable selection; MODEL; LIKELIHOOD; DIMENSION;
D O I
10.1007/s00357-018-9301-y
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Model-based clustering of human population genomic data, composed of 1,318 individuals arisen from western Central Africa and 160,470 markers, is considered. This challenging analysis leads us to develop a new methodology for variable selection in clustering. To explain the differences between subpopulations and to increase the accuracy of the estimates, variable selection is done simultaneously to clustering. We proposed two approaches for selecting variables when clustering is managed by the latent class model (i.e., mixture considering independence within components). The first method simultaneously performs model selection and parameter inference. It optimizes the Bayesian Information Criterion with a modified version of the standard expectation-maximization algorithm. The second method performs model selection without requiring parameter inference by maximizing the Maximum Integrated Complete-data Likelihood criterion. Although the application considers categorical data, the proposed methods are introduced in the general context of mixed data (data composed of different types of features). As the first step, the interest of both proposed methods is shown on simulated and several benchmark real data. Then, we apply the clustering method to the human population genomic data which permits to detect the most discriminative genetic markers. The proposed method implemented in the R package VarSelLCM is available on CRAN.
引用
收藏
页码:124 / 142
页数:19
相关论文
共 50 条
  • [1] Variable Selection for Mixed Data Clustering: Application in Human Population Genomics
    Matthieu Marbac
    Mohammed Sedki
    Tienne Patin
    [J]. Journal of Classification, 2020, 37 : 124 - 142
  • [2] Clustering and variable selection in the presence of mixed variable types and missing data
    Storlie, C. B.
    Myers, S. M.
    Katusic, S. K.
    Weaver, A. L.
    Voigt, R. G.
    Croarkin, P. E.
    Stoeckel, R. E.
    Port, J. D.
    [J]. STATISTICS IN MEDICINE, 2018, 37 (19) : 2884 - 2899
  • [3] Clustering of SNP data with application to genomics
    Ng, Michael K.
    Li, Mark J.
    Ao, Sio I.
    Sham, Pak C.
    Cheung, Yiu-Ming
    Huang, Joshua Z.
    [J]. ICDM 2006: SIXTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, WORKSHOPS, 2006, : 158 - +
  • [4] A mixed integer linear model for clustering with variable selection
    Benati, Stefano
    Garcia, Sergio
    [J]. COMPUTERS & OPERATIONS RESEARCH, 2014, 43 : 280 - 285
  • [5] Clustering and variable selection for categorical multivariate data
    Bontemps, Dominique
    Toussile, Wilson
    [J]. ELECTRONIC JOURNAL OF STATISTICS, 2013, 7 : 2344 - 2371
  • [6] VARIABLE SELECTION FOR A MIXED POPULATION APPLIED IN PROTEOMICS
    Adjed, F.
    Giovannelli, J. -F.
    Giremus, A.
    Dridi, N.
    Szacherski, P.
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 1153 - 1157
  • [7] Application of a genetic algorithm to variable selection in fuzzy clustering
    Röver, C
    Szepannek, G
    [J]. Classification - the Ubiquitous Challenge, 2005, : 674 - 681
  • [8] Variable Selection for Meaningful Clustering of Multitopic Territorial Data
    Angerri, Xavier
    Gibert, Karina
    [J]. MATHEMATICS, 2023, 11 (13)
  • [9] Joint Bayesian Variable Selection and Graph Estimation for Non-linear SVM with Application to Genomics Data
    Sun, Wenli
    Chang, Changgee
    Long, Qi
    [J]. 2020 IEEE 7TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA 2020), 2020, : 315 - 323
  • [10] VARIABLE SELECTION IN CLUSTERING
    FOWLKES, EB
    GNANADESIKAN, R
    KETTENRING, JR
    [J]. JOURNAL OF CLASSIFICATION, 1988, 5 (02) : 205 - 228