A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies

被引:1
|
作者
Van, Richard [1 ,3 ]
Alvarez, Daniel [2 ,3 ]
Mize, Travis [4 ]
Gannavarapu, Sravani [2 ,3 ]
Chintham Reddy, Lohitha [2 ,3 ]
Nasoz, Fatma [2 ,3 ]
Han, Mira V. [1 ,3 ]
机构
[1] Univ Nevada, Sch Life Sci, Las Vegas, NV 89154 USA
[2] Univ Nevada, Dept Comp Sci, Las Vegas, NV USA
[3] Nevada Inst Personalized Med, Las Vegas, NV 89154 USA
[4] Icahn Sch Med Mt Sinai, Inst Genom Hlth, New York, NY USA
基金
美国国家卫生研究院;
关键词
RNA-Seq; Classification; Cancer; Batch effect correction; Normalization; Data scaling; GENE-EXPRESSION; CANCER; TISSUE; DISCOVERY; REMOVAL;
D O I
10.1186/s12859-024-05801-x
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.Results We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.Conclusion By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Evaluating the bias of circRNA predictions from total RNA-Seq data
    Wang, Jinzeng
    Liu, Kang
    Liu, Ya
    Lv, Qi
    Zhang, Fan
    Wang, Haiyun
    ONCOTARGET, 2017, 8 (67) : 110914 - 110921
  • [22] DRscDB: A single-cell RNA-seq resource for data mining and data comparison across species
    Hu Y.
    Tattikota S.G.
    Liu Y.
    Comjean A.
    Gao Y.
    Forman C.
    Kim G.
    Rodiger J.
    Papatheodorou I.
    dos Santos G.
    Mohr S.E.
    Perrimon N.
    Computational and Structural Biotechnology Journal, 2021, 19 : 2018 - 2026
  • [23] Practical bioinformatics pipelines for single-cell RNA-seq data analysis
    Jiangping He
    Lihui Lin
    Jiekai Chen
    Biophysics Reports, 2022, 8 (03) : 158 - 169
  • [24] The Selection of Quantification Pipelines for Illumina RNA-seq Data Using a Subsampling Approach
    Wu, Po-Yen
    Wang, May D.
    2016 3RD IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, 2016, : 78 - 81
  • [26] A rat RNA-Seq transcriptomic BodyMap across 11 organs and 4 developmental stages
    Ying Yu
    James C. Fuscoe
    Chen Zhao
    Chao Guo
    Meiwen Jia
    Tao Qing
    Desmond I. Bannon
    Lee Lancashire
    Wenjun Bao
    Tingting Du
    Heng Luo
    Zhenqiang Su
    Wendell D. Jones
    Carrie L. Moland
    William S. Branham
    Feng Qian
    Baitang Ning
    Yan Li
    Huixiao Hong
    Lei Guo
    Nan Mei
    Tieliu Shi
    Kevin Y. Wang
    Russell D. Wolfinger
    Yuri Nikolsky
    Stephen J. Walker
    Penelope Duerksen-Hughes
    Christopher E. Mason
    Weida Tong
    Jean Thierry-Mieg
    Danielle Thierry-Mieg
    Leming Shi
    Charles Wang
    Nature Communications, 5
  • [27] A rat RNA-Seq transcriptomic BodyMap across 11 organs and 4 developmental stages
    Yu, Ying
    Fuscoe, James C.
    Zhao, Chen
    Guo, Chao
    Jia, Meiwen
    Qing, Tao
    Bannon, Desmond I.
    Lancashire, Lee
    Bao, Wenjun
    Du, Tingting
    Luo, Heng
    Su, Zhenqiang
    Jones, Wendell D.
    Moland, Carrie L.
    Branham, William S.
    Qian, Feng
    Ning, Baitang
    Li, Yan
    Hong, Huixiao
    Guo, Lei
    Mei, Nan
    Shi, Tieliu
    Wang, Kevin Y.
    Wolfinger, Russell D.
    Nikolsky, Yuri
    Walker, Stephen J.
    Duerksen-Hughes, Penelope
    Mason, Christopher E.
    Tong, Weida
    Thierry-Mieg, Jean
    Thierry-Mieg, Danielle
    Shi, Leming
    Wang, Charles
    NATURE COMMUNICATIONS, 2014, 5 : 3230
  • [28] NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data
    Xiao, Ruiyu
    Lu, Guoshan
    Guo, Wanqian
    Jin, Shuilin
    BMC BIOINFORMATICS, 2020, 21 (Suppl 16)
  • [29] NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data
    Ruiyu Xiao
    Guoshan Lu
    Wanqian Guo
    Shuilin Jin
    BMC Bioinformatics, 21
  • [30] Transcriptomic changes in the hypothalamus of ovariectomized mice: Data from RNA-seq analysis
    Wang, Wenjuan
    Yang, Qiyue
    Zhou, Changman
    Jiang, Hai
    Sun, Yanrong
    Wang, Hanfei
    Luo, Xiaofeng
    Zhang, Jinglin
    Wang, Ziyue
    Jia, Jing
    Qin, Lihua
    ANNALS OF ANATOMY-ANATOMISCHER ANZEIGER, 2022, 241