Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

被引:45
|
作者
Lause, Jan [1 ]
Berens, Philipp [1 ,2 ,3 ,4 ]
Kobak, Dmitry [1 ]
机构
[1] Univ Tubingen, Inst Ophthalm Res, Tubingen, Germany
[2] Univ Tubingen, Inst Bioinformat & Med Informat, Tubingen, Germany
[3] Univ Tubingen, Bernstein Ctr Computat Neurosci, Tubingen, Germany
[4] Univ Tubingen, Ctr Integrat Neurosci, Tubingen, Germany
基金
美国国家卫生研究院;
关键词
MOTOR-VEHICLE CRASHES; POISSON-GAMMA MODELS; SAMPLE-MEAN VALUES; SIZE;
D O I
10.1186/s13059-021-02451-7
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. Results We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. Conclusions We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] scmap: projection of single-cell RNA-seq data across data sets
    Vladimir Yu Kiselev
    Andrew Yiu
    Martin Hemberg
    [J]. Nature Methods, 2018, 15 : 359 - 362
  • [42] A Comparison for Dimensionality Reduction Methods of Single-Cell RNA-seq Data
    Xiang, Ruizhi
    Wang, Wencan
    Yang, Lei
    Wang, Shiyuan
    Xu, Chaohan
    Chen, Xiaowen
    [J]. FRONTIERS IN GENETICS, 2021, 12
  • [43] Online Single-cell RNA-seq Data Denoising with Transfer Learning
    Kang, Bowei
    Abeysinghe, Eroma
    Agarwal, Divyansh
    Wang, Quanli
    Pamidighantam, Sudhakar
    Huang, Mo
    Zhang, Nancy R.
    Wang, Jingshu
    [J]. PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2020, PEARC 2020, 2020, : 469 - 472
  • [44] Identification of innate lymphoid cells in single-cell RNA-Seq data
    Madeleine Suffiotti
    Santiago J. Carmona
    Camilla Jandus
    David Gfeller
    [J]. Immunogenetics, 2017, 69 : 439 - 450
  • [45] Detection and removal of barcode swapping in single-cell RNA-seq data
    Jonathan A. Griffiths
    Arianne C. Richard
    Karsten Bach
    Aaron T. L. Lun
    John C. Marioni
    [J]. Nature Communications, 9
  • [46] Identification of innate lymphoid cells in single-cell RNA-Seq data
    Suffiotti, Madeleine
    Carmona, Santiago J.
    Jandus, Camilla
    Gfeller, David
    [J]. IMMUNOGENETICS, 2017, 69 (07) : 439 - 450
  • [47] A web server for comparative analysis of single-cell RNA-seq data
    Amir Alavi
    Matthew Ruffalo
    Aiyappa Parvangada
    Zhilin Huang
    Ziv Bar-Joseph
    [J]. Nature Communications, 9
  • [48] Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R
    McCarthy, Davis J.
    Campbell, Kieran R.
    Lun, Aaron T. L.
    Wills, Quin F.
    [J]. BIOINFORMATICS, 2017, 33 (08) : 1179 - 1186
  • [49] Adding a time resolution to single-cell RNA-seq data with DynaSCOPE™
    Seker, M.
    Gargouri, B.
    Zhu, W.
    Fang, N.
    [J]. ACTA PHYSIOLOGICA, 2022, 236 : 981 - 983
  • [50] AutoImpute: Autoencoder based imputation of single-cell RNA-seq data
    Divyanshu Talwar
    Aanchal Mongia
    Debarka Sengupta
    Angshul Majumdar
    [J]. Scientific Reports, 8