Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies

被引:33
|
作者
Wenric, Stephane [1 ,2 ]
Shemirani, Ruhollah [3 ]
机构
[1] Univ Liege, GIGA Res, Lab Human Genet, Liege, Belgium
[2] Mt Sinai Hosp, Icahn Sch Med, Charles Bronfman Inst Personalized Med, Dept Genet & Genom Sci, New York, NY 10029 USA
[3] Univ Southern Calif, Informat Sci Inst, Dept Comp Sci, Marina Del Rey, CA USA
关键词
RNA-Seq; supervised learning; random forests; variational autoencoders; gene selection; feature selection; transcriptomics; gene expression; CANCER; EXPRESSION; EVOLUTION; RECEPTOR; GROWTH; IGF-1; TOOL;
D O I
10.3389/fgene.2018.00297
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods Moreover, most biological questions imply that such a search should be performed in a multivariate setting, to take into account the inter-genes relationships. Differential expression analysis commonly yields large lists of genes deemed significant, even after adjustment for multiple testing, making the subsequent study possibilities extensive. Here, we explore the use of supervised learning methods to rank large ensembles of genes defined by their expression values measured with RNA-Seq in a typical 2 classes sample set. First, we use one of the variable importance measures generated by the random forests classification algorithm as a metric to rank genes Second, we define the EPS (extreme pseudo-samples) pipeline, making use of VAEs (Variational Autoencoders) and regressors to extract a ranking of genes while leveraging the feature space of both virtual and comparable samples We show that, on 12 cancer RNA-Seq data sets ranging from 323 to 1,210 samples, using either a random forests-based gene selection method or the EPS pipeline outperforms differential expression analysis for 9 and 8 out of the 12 datasets respectively, in terms of identifying subsets of genes associated with survival These results demonstrate the potential of supervised learning-based gene selection methods in RNA-Seq studies and highlight the need to use such multivariate gene selection methods alongside the widely used differential expression analysis.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Effective methods for bulk RNA-seq deconvolution using scnRNA-seq transcriptomes
    Francisco Avila Cobos
    Mohammad Javad Najaf Panah
    Jessica Epps
    Xiaochen Long
    Tsz-Kwong Man
    Hua-Sheng Chiu
    Elad Chomsky
    Evgeny Kiner
    Michael J. Krueger
    Diego di Bernardo
    Luis Voloch
    Jan Molenaar
    Sander R. van Hooff
    Frank Westermann
    Selina Jansky
    Michele L. Redell
    Pieter Mestdagh
    Pavel Sumazin
    Genome Biology, 24
  • [32] Analyzing RNA-Seq Gene Expression Data Using Deep Learning Approaches for Cancer Classification
    Rukhsar, Laiqa
    Bangyal, Waqas Haider
    Ali Khan, Muhammad Sadiq
    Ag Ibrahim, Ag Asri
    Nisar, Kashif
    Rawat, Danda B.
    APPLIED SCIENCES-BASEL, 2022, 12 (04):
  • [33] Effective methods for bulk RNA-seq deconvolution using scnRNA-seq transcriptomes
    Cobos, Francisco Avila
    Panah, Mohammad Javad Najaf
    Epps, Jessica
    Long, Xiaochen
    Man, Tsz-Kwong
    Chiu, Hua-Sheng
    Chomsky, Elad
    Kiner, Evgeny
    Krueger, Michael J.
    di Bernardo, Diego
    Voloch, Luis
    Molenaar, Jan
    van Hooff, Sander R.
    Westermann, Frank
    Jansky, Selina
    Redell, Michele L.
    Mestdagh, Pieter
    Sumazin, Pavel
    GENOME BIOLOGY, 2023, 24 (01)
  • [34] Using RNA-Seq Data to Evaluate Reference Genes Suitable for Gene Expression Studies in Soybean
    Yim, Aldrin Kay-Yuen
    Wong, Johanna Wing-Hang
    Ku, Yee-Shan
    Qin, Hao
    Chan, Ting-Fung
    Lam, Hon-Ming
    PLOS ONE, 2015, 10 (09):
  • [35] Strategies for control selection in case-control studies: An evaluation of two methods.
    Ma, X
    Layefsky, M
    Reynolds, P
    Buffler, PA
    AMERICAN JOURNAL OF EPIDEMIOLOGY, 2001, 153 (11) : S258 - S258
  • [36] RNA-seq of psoriasis case-control sample reveals coexpression of coding genes and long non-coding RNA transcripts
    Ahn, R.
    Gupta, R.
    Lai, K.
    Dimon, M.
    Pons, J.
    Liao, W.
    EXPERIMENTAL DERMATOLOGY, 2014, 23 : 7 - 7
  • [37] Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data
    Franck Rapaport
    Raya Khanin
    Yupu Liang
    Mono Pirun
    Azra Krek
    Paul Zumbo
    Christopher E Mason
    Nicholas D Socci
    Doron Betel
    Genome Biology, 14
  • [38] Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data
    Rapaport, Franck
    Khanin, Raya
    Liang, Yupu
    Pirun, Mono
    Krek, Azra
    Zumbo, Paul
    Mason, Christopher E.
    Socci, Nicholas D.
    Betel, Doron
    GENOME BIOLOGY, 2013, 14 (09):
  • [39] Self-supervised contrastive learning for integrative single cell RNA-seq data analysis
    Han, Wenkai
    Cheng, Yuqi
    Chen, Jiayang
    Zhong, Huawen
    Hu, Zhihang
    Chen, Siyuan
    Zong, Licheng
    Hong, Liang
    Chan, Ting-Fung
    King, Irwin
    Gao, Xin
    Li, Yu
    BRIEFINGS IN BIOINFORMATICS, 2022, 23 (05)
  • [40] Analysis of RNA-Seq data using self-supervised learning for vital status prediction of colorectal cancer patients
    Padegal, Girivinay
    Rao, Murali Krishna
    Ravishankar, Om Amitesh Boggaram
    Acharya, Sathwik
    Athri, Prashanth
    Srinivasa, Gowri
    BMC BIOINFORMATICS, 2023, 24 (01)