NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods

被引:8
|
作者
Wu, Zhenfeng [1 ,2 ]
Liu, Weixiang [3 ]
Jin, Xiufeng [2 ]
Ji, Haishuo [2 ]
Wang, Hua [2 ]
Glusman, Gustavo [4 ]
Robinson, Max [4 ]
Liu, Lin [2 ]
Ruan, Jishou [1 ]
Gao, Shan [2 ]
机构
[1] Nankai Univ, Sch Math Sci, Tianjin, Peoples R China
[2] Nankai Univ, Coll Life Sci, Tianjin, Peoples R China
[3] Shenzhen Univ, Hlth Sci Ctr, Sch Biomed Engn, Shenzhen, Peoples R China
[4] Inst Syst Biol, Washington, DC USA
基金
中国国家自然科学基金;
关键词
gene expression; normalization; evaluation; R package; scRNA-seq; DIFFERENTIAL EXPRESSION; RNA;
D O I
10.3389/fgene.2019.00400
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] GCPReg package for registration of the segmentation gene expression data in Drosophila
    Kozlov, Konstantin N.
    Myasnikova, Ekaterina
    Samsonova, Anastasia A.
    Surkova, Svetlana
    Reinitz, John
    Samsonova, Maria
    [J]. FLY, 2009, 3 (02) : 151 - 156
  • [42] DAKS: An R Package for Data Analysis Methods in Knowledge Space Theory
    Uenlue, Ali
    Sargin, Anatol
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2010, 37 (02):
  • [43] A survey of methods for classification of gene expression data using evolutionary algorithms
    Wahde, M
    Szallasi, Z
    [J]. EXPERT REVIEW OF MOLECULAR DIAGNOSTICS, 2006, 6 (01) : 101 - 110
  • [44] Comparison of discrimination methods for the classification of tumors using gene expression data
    Dudoit, S
    Fridlyand, J
    Speed, TP
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) : 77 - 87
  • [45] Review on statistical methods for gene network reconstruction using expression data
    Wang, Y. X. Rachel
    Huang, Haiyan
    [J]. JOURNAL OF THEORETICAL BIOLOGY, 2014, 362 : 53 - 61
  • [46] geneHummus: an R package to define gene families and their expression in legumes and beyond
    Die, Jose, V
    Elmassry, Moamen M.
    LeBlanc, Kimberly H.
    Awe, Olaitan, I
    Dillman, Allissa
    Busby, Ben
    [J]. BMC GENOMICS, 2019, 20 (1)
  • [47] geneHummus: an R package to define gene families and their expression in legumes and beyond
    Jose V. Die
    Moamen M. Elmassry
    Kimberly H. LeBlanc
    Olaitan I. Awe
    Allissa Dillman
    Ben Busby
    [J]. BMC Genomics, 20
  • [48] edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
    Robinson, Mark D.
    McCarthy, Davis J.
    Smyth, Gordon K.
    [J]. BIOINFORMATICS, 2010, 26 (01) : 139 - 140
  • [49] Codelink: an R package for analysis of GE healthcare gene expression bioarrays
    Diez, Diego
    Alvarez, Rebeca
    Dopazo, Ana
    [J]. BIOINFORMATICS, 2007, 23 (09) : 1168 - 1169
  • [50] pcaGoPromoter - An R Package for Biological and Regulatory Interpretation of Principal Components in Genome-Wide Gene Expression Data
    Hansen, Morten
    Gerds, Thomas Alexander
    Nielsen, Ole Haagen
    Seidelin, Jakob Benedict
    Troelsen, Jesper Thorvald
    Olsen, Jorgen
    [J]. PLOS ONE, 2012, 7 (02):