Variable Selection for Mixed Data Clustering: Application in Human Population Genomics

被引：0

作者：

Matthieu Marbac

Mohammed Sedki

Tienne Patin

机构：

[1] Ensai,CREST

[2] University of Paris-Sud,UMR Inserm

[3] Institut Pasteur,1181

来源：

Journal of Classification | 2020年 / 37卷

关键词：

Human evolutionary genetics; Information criterion; Mixed data; Model-based clustering; Variable selection;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Model-based clustering of human population genomic data, composed of 1,318 individuals arisen from western Central Africa and 160,470 markers, is considered. This challenging analysis leads us to develop a new methodology for variable selection in clustering. To explain the differences between subpopulations and to increase the accuracy of the estimates, variable selection is done simultaneously to clustering. We proposed two approaches for selecting variables when clustering is managed by the latent class model (i.e., mixture considering independence within components). The first method simultaneously performs model selection and parameter inference. It optimizes the Bayesian Information Criterion with a modified version of the standard expectation–maximization algorithm. The second method performs model selection without requiring parameter inference by maximizing the Maximum Integrated Complete-data Likelihood criterion. Although the application considers categorical data, the proposed methods are introduced in the general context of mixed data (data composed of different types of features). As the first step, the interest of both proposed methods is shown on simulated and several benchmark real data. Then, we apply the clustering method to the human population genomic data which permits to detect the most discriminative genetic markers. The proposed method implemented in the R package VarSelLCM is available on CRAN.

引用

页码：124 / 142

页数：18

共 50 条

[41] A variable selection technique in discriminant analysis with application in marketing data
Gupta, AK
Logan, TP
Chen, J
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 1999, 63 (02) : 187 - 199
[42] VarSelLCM: an R/C plus plus package for variable selection in model-based clustering of mixed-data with missing values
Marbac, Matthieu
Sedki, Mohammed
BIOINFORMATICS, 2019, 35 (07) : 1255 - 1257
[43] A Hybrid Supervised Approach to Human Population Identification Using Genomics Data
Araghi, Sahar
Nguyen, Thanh
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2021, 18 (02) : 443 - 454
[44] Comparing Approaches for Clustering Mixed Mode Data: An Application in Marketing Research
Morlini, Isabella
Zani, Sergio
DATA ANALYSIS AND CLASSIFICATION, 2010, : 49 - +
[45] Bayesian variable selection using Knockoffs with applications to genomics
Yap, Jurel K.
Gauran, Iris Ivy M.
COMPUTATIONAL STATISTICS, 2023, 38 (04) : 1771 - 1790
[46] Bayesian variable selection using Knockoffs with applications to genomics
Jurel K. Yap
Iris Ivy M. Gauran
Computational Statistics, 2023, 38 : 1771 - 1790
[47] Population Genomics of Human Adaptation
Lachance, Joseph
Tishkoff, Sarah A.
ANNUAL REVIEW OF ECOLOGY, EVOLUTION, AND SYSTEMATICS, VOL 44, 2013, 44 : 123 - 143
[48] Variable selection in multivariate calibration based on clustering of variable concept
Farrokhnia, Maryam
Karimi, Sadegh
ANALYTICA CHIMICA ACTA, 2016, 902 : 70 - 81
[49] Structured priors for variable selection in integrative genomics.
Lewin, Alexandra
HUMAN HEREDITY, 2023, 88 (SUPPL 1) : 2 - 2
[50] Machine-Learning Prospects for Detecting Selection Signatures Using Population Genomics Data
Kumar, Harshit
Panigrahi, Manjit
Panwar, Anuradha
Rajawat, Divya
Nayak, Sonali Sonejita
Saravanan, K. A.
Kaisa, Kaiho
Parida, Subhashree
Bhushan, Bharat
Dutt, Triveni
JOURNAL OF COMPUTATIONAL BIOLOGY, 2022, 29 (09) : 943 - 960

← 1 2 3 4 5 →