Separating and reintegrating latent variables to improve classification of genomic data

被引:0
|
作者
Payne, Nora Yujia [1 ]
Gagnon-Bartsch, Johann A. [1 ]
机构
[1] Univ Michigan, Dept Stat, 1085 S Univ Ave, Ann Arbor, MI 48109 USA
基金
美国国家科学基金会;
关键词
Classification; Gene expression; Linear discriminant analysis; GENE-EXPRESSION; FEATURE-SELECTION; AIR-POLLUTION; METHYLATION; REGRESSION;
D O I
10.1093/biostatistics/kxab046
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Genomic data sets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes), giving rise to dense latent variation. This latent variation presents both challenges and opportunities for classification. While some of these latent variables may be partially correlated with the phenotype of interest and thus helpful, others may be uncorrelated and merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. To address these challenges, we propose the cross-residualization classifier (CRC). Through an adjustment and ensemble procedure, the CRC estimates and residualizes out the latent variation, trains a classifier on the residuals, and then reintegrates the latent variation in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information. We apply the method to simulated data and a variety of genomic data sets from multiple platforms. In general, we find that the CRC performs well relative to existing classifiers and sometimes offers substantial gains.
引用
收藏
页码:1133 / 1149
页数:17
相关论文
共 50 条
  • [31] Data visualization via latent variables and mixture models: a brief survey
    Rodolphe Priam
    Mohamed Nadif
    Pattern Analysis and Applications, 2016, 19 : 807 - 819
  • [32] CAUSAL-ANALYSIS OF QUALITATIVE DATA WITH MANIFEST AND LATENT-VARIABLES
    LANGEHEINE, R
    ZEITSCHRIFT FUR SOZIALPSYCHOLOGIE, 1982, 13 (03): : 163 - 176
  • [33] Bayesian modeling of ChIP-chip data using latent variables
    Mingqi Wu
    Faming Liang
    Yanan Tian
    BMC Bioinformatics, 10
  • [34] Computation of marginal likelihoods with data-dependent support for latent variables
    Heaps, Sarah E.
    Boys, Richard J.
    Farrow, Malcolm
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2014, 71 : 392 - 401
  • [35] Bayesian modeling of ChIP-chip data using latent variables
    Wu, Mingqi
    Liang, Faming
    Tian, Yanan
    BMC BIOINFORMATICS, 2009, 10
  • [36] A new method of regression on latent variables. Application to spectral data
    Vigneau, E
    Qannari, EM
    Berttand, D
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2002, 63 (01) : 7 - 14
  • [37] Data visualization via latent variables and mixture models: a brief survey
    Priam, Rodolphe
    Nadif, Mohamed
    PATTERN ANALYSIS AND APPLICATIONS, 2016, 19 (03) : 807 - 819
  • [38] Feature-specific penalized latent class analysis for genomic data
    Houseman, E. Andres
    Coull, Brent A.
    Betensky, Rebecca A.
    BIOMETRICS, 2006, 62 (04) : 1062 - 1070
  • [39] Using Misclassification data to Improve Classification Performance
    Pruengkarn, Ratchakoon
    Fung, Chun Che
    Wong, Kok Wai
    2015 12TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY (ECTI-CON), 2015,
  • [40] Data Trimming Methods to Improve Gesture Classification
    Roh, Hye Sung
    Kim, DaeEun
    2021 24TH INTERNATIONAL CONFERENCE ON ELECTRICAL MACHINES AND SYSTEMS (ICEMS 2021), 2021, : 2449 - 2452