A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies

被引:1
|
作者
Van, Richard [1 ,3 ]
Alvarez, Daniel [2 ,3 ]
Mize, Travis [4 ]
Gannavarapu, Sravani [2 ,3 ]
Chintham Reddy, Lohitha [2 ,3 ]
Nasoz, Fatma [2 ,3 ]
Han, Mira V. [1 ,3 ]
机构
[1] Univ Nevada, Sch Life Sci, Las Vegas, NV 89154 USA
[2] Univ Nevada, Dept Comp Sci, Las Vegas, NV USA
[3] Nevada Inst Personalized Med, Las Vegas, NV 89154 USA
[4] Icahn Sch Med Mt Sinai, Inst Genom Hlth, New York, NY USA
基金
美国国家卫生研究院;
关键词
RNA-Seq; Classification; Cancer; Batch effect correction; Normalization; Data scaling; GENE-EXPRESSION; CANCER; TISSUE; DISCOVERY; REMOVAL;
D O I
10.1186/s12859-024-05801-x
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.Results We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.Conclusion By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Improving the Flexibility of RNA-Seq Data Analysis Pipelines
    Phan, John H.
    Wu, Po-Yen
    Wang, May D.
    2012 IEEE INTERNATIONAL WORKSHOP ON GENOMIC SIGNAL PROCESSING AND STATISTICS (GENSIPS), 2012, : 70 - 73
  • [2] FastqPuri: high-performance preprocessing of RNA-seq data
    Paula Pérez-Rubio
    Claudio Lottaz
    Julia C. Engelmann
    BMC Bioinformatics, 20
  • [3] FastqPuri: high-performance preprocessing of RNA-seq data
    Perez-Rubio, Paula
    Lottaz, Claudio
    Engelmann, Julia C.
    BMC BIOINFORMATICS, 2019, 20 (1)
  • [4] A benchmarking of pipelines for detecting ncRNAs from RNA-Seq data
    Di Bella, Sebastiano
    La Ferlita, Alessandro
    Carapezza, Giovanni
    Alaimo, Salvatore
    Isacchi, Antonella
    Ferro, Alfredo
    Pulvirenti, Alfredo
    Bosotti, Roberta
    BRIEFINGS IN BIOINFORMATICS, 2020, 21 (06) : 1987 - 1998
  • [5] A Comprehensive Mouse Transcriptomic BodyMap across 17 Tissues by RNA-seq
    Li, Bin
    Qing, Tao
    Zhu, Jinhang
    Wen, Zhuo
    Yu, Ying
    Fukumura, Ryutaro
    Zheng, Yuanting
    Gondo, Yoichi
    Shi, Leming
    SCIENTIFIC REPORTS, 2017, 7
  • [6] Comparison of transcriptomic landscapes of bovine embryos using RNA-Seq
    Wen Huang
    Hasan Khatib
    BMC Genomics, 11
  • [7] A Comprehensive Mouse Transcriptomic BodyMap across 17 Tissues by RNA-seq
    Bin Li
    Tao Qing
    Jinhang Zhu
    Zhuo Wen
    Ying Yu
    Ryutaro Fukumura
    Yuanting Zheng
    Yoichi Gondo
    Leming Shi
    Scientific Reports, 7
  • [8] Comparison of transcriptomic landscapes of bovine embryos using RNA-Seq
    Huang, Wen
    Khatib, Hasan
    BMC GENOMICS, 2010, 11
  • [9] A benchmark for RNA-seq quantification pipelines
    Teng, Mingxiang
    Love, Michael I.
    Davis, Carrie A.
    Djebali, Sarah
    Dobin, Alexander
    Graveley, Brenton R.
    Li, Sheng
    Mason, Christopher E.
    Olson, Sara
    Pervouchine, Dmitri
    Sloan, Cricket A.
    Wei, Xintao
    Zhan, Lijun
    Irizarry, Rafael A.
    GENOME BIOLOGY, 2016, 17
  • [10] A benchmark for RNA-seq quantification pipelines
    Mingxiang Teng
    Michael I. Love
    Carrie A. Davis
    Sarah Djebali
    Alexander Dobin
    Brenton R. Graveley
    Sheng Li
    Christopher E. Mason
    Sara Olson
    Dmitri Pervouchine
    Cricket A. Sloan
    Xintao Wei
    Lijun Zhan
    Rafael A. Irizarry
    Genome Biology, 17