Variable Selection for Mixed Data Clustering: Application in Human Population Genomics

被引:0
|
作者
Matthieu Marbac
Mohammed Sedki
Tienne Patin
机构
[1] Ensai,CREST
[2] University of Paris-Sud,UMR Inserm
[3] Institut Pasteur,1181
来源
Journal of Classification | 2020年 / 37卷
关键词
Human evolutionary genetics; Information criterion; Mixed data; Model-based clustering; Variable selection;
D O I
暂无
中图分类号
学科分类号
摘要
Model-based clustering of human population genomic data, composed of 1,318 individuals arisen from western Central Africa and 160,470 markers, is considered. This challenging analysis leads us to develop a new methodology for variable selection in clustering. To explain the differences between subpopulations and to increase the accuracy of the estimates, variable selection is done simultaneously to clustering. We proposed two approaches for selecting variables when clustering is managed by the latent class model (i.e., mixture considering independence within components). The first method simultaneously performs model selection and parameter inference. It optimizes the Bayesian Information Criterion with a modified version of the standard expectation–maximization algorithm. The second method performs model selection without requiring parameter inference by maximizing the Maximum Integrated Complete-data Likelihood criterion. Although the application considers categorical data, the proposed methods are introduced in the general context of mixed data (data composed of different types of features). As the first step, the interest of both proposed methods is shown on simulated and several benchmark real data. Then, we apply the clustering method to the human population genomic data which permits to detect the most discriminative genetic markers. The proposed method implemented in the R package VarSelLCM is available on CRAN.
引用
收藏
页码:124 / 142
页数:18
相关论文
共 50 条
  • [21] Variable Selection for Clustering and Classification
    Jeffrey L. Andrews
    Paul D. McNicholas
    Journal of Classification, 2014, 31 : 136 - 153
  • [22] Variable Selection for Clustering and Classification
    Andrews, Jeffrey L.
    McNicholas, Paul D.
    JOURNAL OF CLASSIFICATION, 2014, 31 (02) : 136 - 153
  • [23] Leveraging pleiotropic association using sparse group variable selection in genomics data
    Matthew Sutton
    Pierre-Emmanuel Sugier
    Therese Truong
    Benoit Liquet
    BMC Medical Research Methodology, 22
  • [24] Leveraging pleiotropic association using sparse group variable selection in genomics data
    Sutton, Matthew
    Sugier, Pierre-Emmanuel
    Truong, Therese
    Liquet, Benoit
    BMC MEDICAL RESEARCH METHODOLOGY, 2022, 22 (01)
  • [25] Bayesian approaches to variable selection in mixture models with application to disease clustering
    Lu, Zihang
    Lou, Wendy
    JOURNAL OF APPLIED STATISTICS, 2023, 50 (02) : 387 - 407
  • [26] Bayesian clustering of mixed-type data with relevant variable identification
    Burhanuddin, Nurul Afiqah
    Ibrahim, Kamarulzaman
    Adam, Mohd Bakri
    Mustapha, Norwati
    Zulkafli, Hani Syahida
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2024,
  • [27] Hierarchical clustering of mixed variable panel data based on new distance
    Akay, Ozlem
    Yuksel, Guzin
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2021, 50 (06) : 1695 - 1710
  • [28] Bayesian Variable Selection in Clustering High-Dimensional Data With Substructure
    Swartz, Michael D.
    Mo, Qianxing
    Murphy, Mary E.
    Lupton, Joanne R.
    Turner, Nancy D.
    Hong, Mee Young
    Vannucci, Marina
    JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS, 2008, 13 (04) : 407 - 423
  • [29] Accelerating a Gibbs sampler for variable selection on genomics data with summarization and variable pre-selection combining an array DBMS and R
    David Sergio Matusevich
    Wellington Cabrera
    Carlos Ordonez
    Machine Learning, 2016, 102 : 483 - 504
  • [30] Bayesian variable selection in clustering high-dimensional data with substructure
    Michael D. Swartz
    Qianxing Mo
    Mary E. Murphy
    Joanne R. Lupton
    Nancy D. Turner
    Mee Young Hong
    Marina Vannucci
    Journal of Agricultural, Biological, and Environmental Statistics, 2008, 13 : 407 - 423