Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies

被引:33
|
作者
Wenric, Stephane [1 ,2 ]
Shemirani, Ruhollah [3 ]
机构
[1] Univ Liege, GIGA Res, Lab Human Genet, Liege, Belgium
[2] Mt Sinai Hosp, Icahn Sch Med, Charles Bronfman Inst Personalized Med, Dept Genet & Genom Sci, New York, NY 10029 USA
[3] Univ Southern Calif, Informat Sci Inst, Dept Comp Sci, Marina Del Rey, CA USA
关键词
RNA-Seq; supervised learning; random forests; variational autoencoders; gene selection; feature selection; transcriptomics; gene expression; CANCER; EXPRESSION; EVOLUTION; RECEPTOR; GROWTH; IGF-1; TOOL;
D O I
10.3389/fgene.2018.00297
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods Moreover, most biological questions imply that such a search should be performed in a multivariate setting, to take into account the inter-genes relationships. Differential expression analysis commonly yields large lists of genes deemed significant, even after adjustment for multiple testing, making the subsequent study possibilities extensive. Here, we explore the use of supervised learning methods to rank large ensembles of genes defined by their expression values measured with RNA-Seq in a typical 2 classes sample set. First, we use one of the variable importance measures generated by the random forests classification algorithm as a metric to rank genes Second, we define the EPS (extreme pseudo-samples) pipeline, making use of VAEs (Variational Autoencoders) and regressors to extract a ranking of genes while leveraging the feature space of both virtual and comparable samples We show that, on 12 cancer RNA-Seq data sets ranging from 323 to 1,210 samples, using either a random forests-based gene selection method or the EPS pipeline outperforms differential expression analysis for 9 and 8 out of the 12 datasets respectively, in terms of identifying subsets of genes associated with survival These results demonstrate the potential of supervised learning-based gene selection methods in RNA-Seq studies and highlight the need to use such multivariate gene selection methods alongside the widely used differential expression analysis.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Crafted experiments to evaluate feature selection methods for single cell RNA-seq data
    Liu, Siyao
    Corcoran, David
    Garcia-Recio, Susana
    Perou, Charles
    Marron, J. S.
    CANCER RESEARCH, 2024, 84 (07)
  • [42] RNAmp: Detecting Gene Amplification Events Using Clinical RNA-Seq
    Fisch, A.
    Singh, A.
    Lennerz, J.
    JOURNAL OF MOLECULAR DIAGNOSTICS, 2023, 25 (11): : S164 - S164
  • [43] Analysis of RNA-Seq data using self-supervised learning for vital status prediction of colorectal cancer patients
    Girivinay Padegal
    Murali Krishna Rao
    Om Amitesh Boggaram Ravishankar
    Sathwik Acharya
    Prashanth Athri
    Gowri Srinivasa
    BMC Bioinformatics, 24
  • [44] Differential gene expression analysis using coexpression and RNA-Seq data
    Yang, Ei-Wen
    Girke, Thomas
    Jiang, Tao
    BIOINFORMATICS, 2013, 29 (17) : 2153 - 2161
  • [45] Integrating Deep Supervised, Self-Supervised and Unsupervised Learning for Single-Cell RNA-seq Clustering and Annotation
    Chen, Liang
    Zhai, Yuyao
    He, Qiuyan
    Wang, Weinan
    Deng, Minghua
    GENES, 2020, 11 (07) : 1 - 20
  • [46] Practical selection of representative sets of RNA-seq samples using a hierarchical approach
    Tung, Laura H.
    Kingsford, Carl
    BIOINFORMATICS, 2021, 37 : I334 - I341
  • [47] Gene selection by incorporating genetic networks into case-control association studies
    Xuewei Cao
    Xiaoyu Liang
    Shuanglin Zhang
    Qiuying Sha
    European Journal of Human Genetics, 2024, 32 : 270 - 277
  • [48] Gene selection by incorporating genetic networks into case-control association studies
    Cao, Xuewei
    Liang, Xiaoyu
    Zhang, Shuanglin
    Sha, Qiuying
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 (03) : 270 - 277
  • [49] Binary hiking optimization for gene selection: Insights from HNSCC RNA-Seq data
    Pashaei, Elnaz
    Pashaei, Elham
    Mirjalili, Seyedali
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 268
  • [50] The Selection of Quantification Pipelines for Illumina RNA-seq Data Using a Subsampling Approach
    Wu, Po-Yen
    Wang, May D.
    2016 3RD IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, 2016, : 78 - 81