Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

被引:0
|
作者
Aboubacry Gaye
Abdou Ka Diongue
Seydou Nourou Sylla
Maryam Diarra
Amadou Diallo
Cheikh Talla
Cheikh Loucoubar
机构
[1] Laboratory for Studies and Research in Statistics and Development,
[2] Gaston Berger University of Saint Louis,undefined
[3] Epidemiology,undefined
[4] Clinical Research and Data Science Unit,undefined
[5] Institut Pasteur de Dakar,undefined
[6] Information and Communication Technologies for Development,undefined
[7] Alioune Diop University of Bambey,undefined
来源
Journal of Classification | 2024年 / 41卷
关键词
Supervised dimension reduction; Correlation blocks; High-dimensional supervised classification; Genomic data;
D O I
暂无
中图分类号
学科分类号
摘要
This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.
引用
收藏
页码:158 / 169
页数:11
相关论文
共 50 条
  • [1] Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data
    Gaye, Aboubacry
    Diongue, Abdou Ka
    Sylla, Seydou Nourou
    Diarra, Maryam
    Diallo, Amadou
    Talla, Cheikh
    Loucoubar, Cheikh
    [J]. JOURNAL OF CLASSIFICATION, 2024, 41 (01) : 158 - 169
  • [2] On the Identification of Correlated Differential Features for Supervised Classification of High-Dimensional Data
    Ng, Shu Kay
    McLachlan, Geoffrey J.
    [J]. DATA SCIENCE: INNOVATIVE DEVELOPMENTS IN DATA ANALYSIS AND CLUSTERING, 2017, : 43 - 57
  • [3] A novel ensemble method for high-dimensional genomic data classification
    Espichan, Alexandra
    Villanueva, Edwin
    [J]. PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 2229 - 2236
  • [4] Clustering of High-Dimensional and Correlated Data
    McLachlan, Geoffrey J.
    Ng, Shu-Kay
    Wang, K.
    [J]. DATA ANALYSIS AND CLASSIFICATION, 2010, : 3 - 11
  • [5] Stability of feature selection in classification issues for high-dimensional correlated data
    Émeline Perthame
    Chloé Friguet
    David Causeur
    [J]. Statistics and Computing, 2016, 26 : 783 - 796
  • [6] Stability of feature selection in classification issues for high-dimensional correlated data
    Perthame, Emeline
    Friguet, Chloe
    Causeur, David
    [J]. STATISTICS AND COMPUTING, 2016, 26 (04) : 783 - 796
  • [7] Classification methods for the development of genomic signatures from high-dimensional data
    Hojin Moon
    Hongshik Ahn
    Ralph L Kodell
    Chien-Ju Lin
    Songjoon Baek
    James J Chen
    [J]. Genome Biology, 7
  • [8] Classification methods for the development of genomic signatures from high-dimensional data
    Moon, Hojin
    Ahn, Hongshik
    Kodell, Ralph L.
    Lin, Chien-Ju
    Baek, Songjoon
    Chen, James J.
    [J]. GENOME BIOLOGY, 2006, 7 (12)
  • [9] A variable selection approach for highly correlated predictors in high-dimensional genomic data
    Zhu, Wencan
    Levy-Leduc, Celine
    Ternes, Nils
    [J]. BIOINFORMATICS, 2021, 37 (16) : 2238 - 2244
  • [10] Towards Correlated Data Trading for High-Dimensional Private Data
    Cai, Hui
    Yang, Yuanyuan
    Fan, Weibei
    Xiao, Fu
    Zhu, Yanmin
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (03) : 1047 - 1059