Separating and reintegrating latent variables to improve classification of genomic data

被引:0
|
作者
Payne, Nora Yujia [1 ]
Gagnon-Bartsch, Johann A. [1 ]
机构
[1] Univ Michigan, Dept Stat, 1085 S Univ Ave, Ann Arbor, MI 48109 USA
基金
美国国家科学基金会;
关键词
Classification; Gene expression; Linear discriminant analysis; GENE-EXPRESSION; FEATURE-SELECTION; AIR-POLLUTION; METHYLATION; REGRESSION;
D O I
10.1093/biostatistics/kxab046
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Genomic data sets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes), giving rise to dense latent variation. This latent variation presents both challenges and opportunities for classification. While some of these latent variables may be partially correlated with the phenotype of interest and thus helpful, others may be uncorrelated and merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. To address these challenges, we propose the cross-residualization classifier (CRC). Through an adjustment and ensemble procedure, the CRC estimates and residualizes out the latent variation, trains a classifier on the residuals, and then reintegrates the latent variation in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information. We apply the method to simulated data and a variety of genomic data sets from multiple platforms. In general, we find that the CRC performs well relative to existing classifiers and sometimes offers substantial gains.
引用
收藏
页码:1133 / 1149
页数:17
相关论文
共 50 条
  • [21] Weighted Tensor Decomposition for Learning Latent Variables with Partial Data
    Gottesman, Omer
    Pan, Weiewei
    Doshi-Velez, Finale
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 84, 2018, 84
  • [22] A Compact Representation of Visual Speech Data Using Latent Variables
    Zhou, Ziheng
    Hong, Xiaopeng
    Zhao, Guoying
    Pietikainen, Matti
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (01) : 181 - 187
  • [23] ALL classification—integration of genomic and cytogenetic data
    Alessia Errico
    Nature Reviews Clinical Oncology, 2014, 11 (8) : 440 - 440
  • [24] Reconstructing a latent representation of gene expression from genomic alterations to improve clinical utility of real-world clinicogenomics data
    Baron, Maayan
    Kumar, Sunil
    Kuperwaser, Felicia
    Tracy, Dillon
    Vucic, Emily
    Sherman, Jeff
    CANCER RESEARCH, 2024, 84 (06)
  • [25] USING LATENT TOPIC FEATURES TO IMPROVE BINARY CLASSIFICATION OF SPOKEN DOCUMENTS
    Wintrode, Jonathan
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5544 - 5547
  • [26] Extracting Domain-Dependent Semantic Orientations of Latent Variables for Sentiment Classification
    Lee, Yeha
    Kim, Jungi
    Lee, Jong-Hyeok
    COMPUTER PROCESSING OF ORIENTAL LANGUAGES: LANGUAGE TECHNOLOGY FOR THE KNOWLEDGE-BASED ECONOMY, 2009, 5459 : 201 - 212
  • [27] Latent Variables Improve Hard-Constrained Controllable Text Generation on Weak Correlation
    Zhu, Weigang
    Liu, Xiaoming
    Yang, Guan
    Liu, Jie
    Qi, Haotian
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (06) : 365 - 374
  • [28] Joint regression analysis of clustered current status data with latent variables
    Feng, Yanqin
    Wu, Sijie
    Ding, Jieli
    STATISTICAL METHODS IN MEDICAL RESEARCH, 2025, 34 (02) : 224 - 242
  • [29] Latent variables, measurement error and methods for analyzing longitudinal ordinal data
    Palta, M
    Lin, CY
    AMERICAN STATISTICAL ASSOCIATION 1996 PROCEEDINGS OF THE BIOMETRICS SECTION, 1996, : 340 - 345