Variable Selection for Mixed Data Clustering: Application in Human Population Genomics

被引:0
|
作者
Matthieu Marbac
Mohammed Sedki
Tienne Patin
机构
[1] Ensai,CREST
[2] University of Paris-Sud,UMR Inserm
[3] Institut Pasteur,1181
来源
Journal of Classification | 2020年 / 37卷
关键词
Human evolutionary genetics; Information criterion; Mixed data; Model-based clustering; Variable selection;
D O I
暂无
中图分类号
学科分类号
摘要
Model-based clustering of human population genomic data, composed of 1,318 individuals arisen from western Central Africa and 160,470 markers, is considered. This challenging analysis leads us to develop a new methodology for variable selection in clustering. To explain the differences between subpopulations and to increase the accuracy of the estimates, variable selection is done simultaneously to clustering. We proposed two approaches for selecting variables when clustering is managed by the latent class model (i.e., mixture considering independence within components). The first method simultaneously performs model selection and parameter inference. It optimizes the Bayesian Information Criterion with a modified version of the standard expectation–maximization algorithm. The second method performs model selection without requiring parameter inference by maximizing the Maximum Integrated Complete-data Likelihood criterion. Although the application considers categorical data, the proposed methods are introduced in the general context of mixed data (data composed of different types of features). As the first step, the interest of both proposed methods is shown on simulated and several benchmark real data. Then, we apply the clustering method to the human population genomic data which permits to detect the most discriminative genetic markers. The proposed method implemented in the R package VarSelLCM is available on CRAN.
引用
收藏
页码:124 / 142
页数:18
相关论文
共 50 条
  • [31] Accelerating a Gibbs sampler for variable selection on genomics data with summarization and variable pre-selection combining an array DBMS and R
    Matusevich, David Sergio
    Cabrera, Wellington
    Ordonez, Carlos
    MACHINE LEARNING, 2016, 102 (03) : 483 - 504
  • [32] Application of data clustering and machine learning in variable annuity valuation
    Gan, Guojun
    INSURANCE MATHEMATICS & ECONOMICS, 2013, 53 (03): : 795 - 801
  • [33] Population choice and variable selection in the estimation and application of risk models
    Dudley, RA
    Rennie, DJ
    Luft, HS
    INQUIRY-THE JOURNAL OF HEALTH CARE ORGANIZATION PROVISION AND FINANCING, 1999, 36 (02) : 200 - 211
  • [34] Trace pursuit variable selection for multi-population data
    Huo, Lei
    Wen, Xuerong Meggie
    Yu, Zhou
    JOURNAL OF NONPARAMETRIC STATISTICS, 2018, 30 (02) : 430 - 447
  • [35] Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
    O'Shea, Robert J.
    Tsoka, Sophia
    Cook, Gary J. R.
    Goh, Vicky
    CANCER INFORMATICS, 2021, 20
  • [36] Random Projection Based Clustering for Population Genomics
    Tasoulis, Sotiris
    Cheng, Lu
    Valimaki, Niko
    Croucher, Nicholas J.
    Harris, Simon R.
    Hanage, William P.
    Roos, Teemu
    Corander, Jukka
    2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2014, : 675 - 682
  • [37] Clustering mixed data
    Hunt, Lynette
    Jorgensen, Murray
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2011, 1 (04) : 352 - 361
  • [38] Variable selection in model-based clustering using multilocus genotype data
    Toussile W.
    Gassiat E.
    Advances in Data Analysis and Classification, 2009, 3 (2) : 109 - 134
  • [39] Robust variable selection in semiparametric mixed effects longitudinal data models
    Sun, Huihui
    Liu, Qiang
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2024, 53 (03) : 1049 - 1064
  • [40] Variable selection for semicontinuous data with application to dietary pattern analysis
    Lu, Yahui
    Jiang, Tao
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2020, 127 : 29 - 29