A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies

被引：1

作者：

Van, Richard ^{[1
,3
]}

Alvarez, Daniel ^{[2
,3
]}

Mize, Travis ^{[4
]}

Gannavarapu, Sravani ^{[2
,3
]}

Chintham Reddy, Lohitha ^{[2
,3
]}

Nasoz, Fatma ^{[2
,3
]}

Han, Mira V. ^{[1
,3
]}

机构：

[1] Univ Nevada, Sch Life Sci, Las Vegas, NV 89154 USA

[2] Univ Nevada, Dept Comp Sci, Las Vegas, NV USA

[3] Nevada Inst Personalized Med, Las Vegas, NV 89154 USA

[4] Icahn Sch Med Mt Sinai, Inst Genom Hlth, New York, NY USA

来源：

BMC BIOINFORMATICS | 2024年 / 25卷 / 01期

基金：

美国国家卫生研究院;

关键词：

RNA-Seq; Classification; Cancer; Batch effect correction; Normalization; Data scaling; GENE-EXPRESSION; CANCER; TISSUE; DISCOVERY; REMOVAL;

D O I：

10.1186/s12859-024-05801-x

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.Results We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.Conclusion By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

引用

页数：22

共 50 条

[21] Evaluating the bias of circRNA predictions from total RNA-Seq data
Wang, Jinzeng
Liu, Kang
Liu, Ya
Lv, Qi
Zhang, Fan
Wang, Haiyun
ONCOTARGET, 2017, 8 (67) : 110914 - 110921
[22] DRscDB: A single-cell RNA-seq resource for data mining and data comparison across species
Hu Y.
Tattikota S.G.
Liu Y.
Comjean A.
Gao Y.
Forman C.
Kim G.
Rodiger J.
Papatheodorou I.
dos Santos G.
Mohr S.E.
Perrimon N.
Computational and Structural Biotechnology Journal, 2021, 19 : 2018 - 2026
[23] Practical bioinformatics pipelines for single-cell RNA-seq data analysis
Jiangping He
Lihui Lin
Jiekai Chen
Biophysics Reports, 2022, 8 (03) : 158 - 169
[24] The Selection of Quantification Pipelines for Illumina RNA-seq Data Using a Subsampling Approach
Wu, Po-Yen
Wang, May D.
2016 3RD IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, 2016, : 78 - 81
[25] Standardizing RNA-seq across laboratories
Nature Methods, 2013, 10 (11) : 1055 - 1055
[26] A rat RNA-Seq transcriptomic BodyMap across 11 organs and 4 developmental stages
Ying Yu
James C. Fuscoe
Chen Zhao
Chao Guo
Meiwen Jia
Tao Qing
Desmond I. Bannon
Lee Lancashire
Wenjun Bao
Tingting Du
Heng Luo
Zhenqiang Su
Wendell D. Jones
Carrie L. Moland
William S. Branham
Feng Qian
Baitang Ning
Yan Li
Huixiao Hong
Lei Guo
Nan Mei
Tieliu Shi
Kevin Y. Wang
Russell D. Wolfinger
Yuri Nikolsky
Stephen J. Walker
Penelope Duerksen-Hughes
Christopher E. Mason
Weida Tong
Jean Thierry-Mieg
Danielle Thierry-Mieg
Leming Shi
Charles Wang
Nature Communications, 5
[27] A rat RNA-Seq transcriptomic BodyMap across 11 organs and 4 developmental stages
Yu, Ying
Fuscoe, James C.
Zhao, Chen
Guo, Chao
Jia, Meiwen
Qing, Tao
Bannon, Desmond I.
Lancashire, Lee
Bao, Wenjun
Du, Tingting
Luo, Heng
Su, Zhenqiang
Jones, Wendell D.
Moland, Carrie L.
Branham, William S.
Qian, Feng
Ning, Baitang
Li, Yan
Hong, Huixiao
Guo, Lei
Mei, Nan
Shi, Tieliu
Wang, Kevin Y.
Wolfinger, Russell D.
Nikolsky, Yuri
Walker, Stephen J.
Duerksen-Hughes, Penelope
Mason, Christopher E.
Tong, Weida
Thierry-Mieg, Jean
Thierry-Mieg, Danielle
Shi, Leming
Wang, Charles
NATURE COMMUNICATIONS, 2014, 5 : 3230
[28] NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data
Xiao, Ruiyu
Lu, Guoshan
Guo, Wanqian
Jin, Shuilin
BMC BIOINFORMATICS, 2020, 21 (Suppl 16)
[29] NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data
Ruiyu Xiao
Guoshan Lu
Wanqian Guo
Shuilin Jin
BMC Bioinformatics, 21
[30] Transcriptomic changes in the hypothalamus of ovariectomized mice: Data from RNA-seq analysis
Wang, Wenjuan
Yang, Qiyue
Zhou, Changman
Jiang, Hai
Sun, Yanrong
Wang, Hanfei
Luo, Xiaofeng
Zhang, Jinglin
Wang, Ziyue
Jia, Jing
Qin, Lihua
ANNALS OF ANATOMY-ANATOMISCHER ANZEIGER, 2022, 241

← 1 2 3 4 5 →