A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies

被引：1

作者：

Van, Richard ^{[1
,3
]}

Alvarez, Daniel ^{[2
,3
]}

Mize, Travis ^{[4
]}

Gannavarapu, Sravani ^{[2
,3
]}

Chintham Reddy, Lohitha ^{[2
,3
]}

Nasoz, Fatma ^{[2
,3
]}

Han, Mira V. ^{[1
,3
]}

机构：

[1] Univ Nevada, Sch Life Sci, Las Vegas, NV 89154 USA

[2] Univ Nevada, Dept Comp Sci, Las Vegas, NV USA

[3] Nevada Inst Personalized Med, Las Vegas, NV 89154 USA

[4] Icahn Sch Med Mt Sinai, Inst Genom Hlth, New York, NY USA

来源：

BMC BIOINFORMATICS | 2024年 / 25卷 / 01期

基金：

美国国家卫生研究院;

关键词：

RNA-Seq; Classification; Cancer; Batch effect correction; Normalization; Data scaling; GENE-EXPRESSION; CANCER; TISSUE; DISCOVERY; REMOVAL;

D O I：

10.1186/s12859-024-05801-x

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.Results We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.Conclusion By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

引用

页数：22

共 50 条

[31] ARPIR: automatic RNA-Seq pipelines with interactive report
Giulio Spinozzi
Valentina Tini
Alessia Adorni
Brunangelo Falini
Maria Paola Martelli
BMC Bioinformatics, 21
[32] NDRindex: A method for the quality assessment of single-cell RNA-Seq preprocessing data
Xiao, Ruiyu
Lu, Guoshan
Jin, Shuilin
2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 1792 - 1800
[33] ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data
Orjuela, Stephany
Huang, Ruizhu
Hembach, Katharina M.
Robinson, Mark D.
Soneson, Charlotte
G3-GENES GENOMES GENETICS, 2019, 9 (07): : 2089 - 2096
[34] pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools
Pierre-Luc Germain
Anthony Sonrel
Mark D. Robinson
Genome Biology, 21
[35] pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools
Germain, Pierre-Luc
Sonrel, Anthony
Robinson, Mark D.
GENOME BIOLOGY, 2020, 21 (01)
[36] ARPIR: automatic RNA-Seq pipelines with interactive report
Spinozzi, Giulio
Tini, Valentina
Adorni, Alessia
Falini, Brunangelo
Martelli, Maria Paola
BMC BIOINFORMATICS, 2020, 21 (Suppl 19)
[37] Defining the transcriptomic landscape of Candida glabrata by RNA-Seq
Linde, Joerg
Duggan, Seana
Weber, Michael
Horn, Fabian
Sieber, Patricia
Hellwig, Daniela
Riege, Konstantin
Marz, Manja
Martin, Ronny
Guthke, Reinhard
Kurzai, Oliver
NUCLEIC ACIDS RESEARCH, 2015, 43 (03) : 1392 - 1406
[38] Transcriptomic annotation of the Chungtien schizothoracin (Ptychobarbus chungtienensis) using Iso-seq and RNA-seq data
Gao, Zhendong
Chong, Yuqing
Lu, Ying
Ma, Shiguang
Wang, Zhen
Hong, Jieyun
Wu, Jiao
Li, Mengfei
Xi, Dongmei
Deng, Weidong
SCIENTIFIC DATA, 2024, 11 (01)
[39] Clustering of RNA-Seq samples: Comparison study on cancer data
Jaskowiak, Pablo Andretta
Costa, Ivan G.
Campello, Ricardo J. G. B.
METHODS, 2018, 132 : 42 - 49
[40] Comparison of transformations for single-cell RNA-seq data
Constantin Ahlmann-Eltze
Wolfgang Huber
Nature Methods, 2023, 20 : 665 - 672

← 1 2 3 4 5 →