Detecting and correcting systematic variation in large-scale RNA sequencing data

被引:121
|
作者
Li, Sheng [1 ,2 ]
Labaj, Pawel P. [3 ]
Zumbo, Paul [1 ,2 ]
Sykacek, Peter [3 ]
Shi, Wei [4 ]
Shi, Leming [5 ,6 ,7 ]
Phan, John [8 ]
Wu, Po-Yen [8 ]
Wang, May [8 ]
Wang, Charles [9 ,10 ]
Thierry-Mieg, Danielle [11 ]
Thierry-Mieg, Jean [11 ]
Kreil, David P. [3 ,12 ]
Mason, Christopher E. [1 ,2 ,13 ]
机构
[1] Weill Cornell Med Coll, Dept Physiol & Biophys, New York, NY 10065 USA
[2] Weill Cornell Med Coll, HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsau, New York, NY USA
[3] Boku Univ Vienna, Bioinformat Res Grp, Vienna, Austria
[4] WEHI, Dept Bioinformat, Melbourne, Vic, Australia
[5] Fudan Univ, State Key Lab Genet Engn, Sch Life Sci, Shanghai 200433, Peoples R China
[6] Fudan Univ, MOE Key Lab Contemporary Anthropol, Sch Life Sci, Shanghai 200433, Peoples R China
[7] Fudan Univ, Sch Pharm, Shanghai 200433, Peoples R China
[8] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
[9] Loma Linda Univ, Ctr Genom, Loma Linda, CA 92350 USA
[10] Loma Linda Univ, Sch Med, Div Microbiol & Mol Genet, Loma Linda, CA USA
[11] Natl Ctr Biotechnol Informat, Bethesda, MD USA
[12] Univ Warwick, Coventry CV4 7AL, W Midlands, England
[13] Feil Family Brain & Mind Res Inst, New York, NY USA
基金
美国国家卫生研究院;
关键词
QUALITY-CONTROL; GENE-EXPRESSION; DIFFERENTIAL EXPRESSION; UNWANTED VARIATION; MESSENGER-RNA; SEQ; NORMALIZATION; TRANSCRIPTS; ALGORITHMS; PACKAGE;
D O I
10.1038/nbt.3000
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.
引用
收藏
页码:888 / 895
页数:8
相关论文
共 50 条
  • [21] SCEMENT: scalable and memory efficient integration of large-scale single-cell RNA-sequencing data
    Chockalingam, Sriram P.
    Aluru, Maneesha
    Aluru, Srinivas
    BIOINFORMATICS, 2025, 41 (02)
  • [22] Integrating Genetics and Epigenetics with Large-scale RNA-sequencing of Schizophrenia Brains
    Roussos, Panos
    NEUROPSYCHOPHARMACOLOGY, 2014, 39 : S398 - S398
  • [23] Refining SARS-CoV-2 intra-host variation by leveraging large-scale sequencing data
    Mostefai, Fatima
    Grenier, Jean-Christophe
    Poujol, Raphael
    Hussin, Julie
    NAR GENOMICS AND BIOINFORMATICS, 2024, 6 (04)
  • [24] Detecting False Data Injection in a Large-Scale Water Distribution Network
    Oluyomi, Ayanfeoluwa
    2023 IEEE INTERNATIONAL CONFERENCE ON SMART COMPUTING, SMARTCOMP, 2023, : 243 - 244
  • [25] Detecting Discontinuities in Large-Scale Systems
    Malik, Haroon
    Davis, Ian J.
    Godfrey, Michael W.
    Neuse, Douglas
    Mankovskii, Serge
    2014 IEEE/ACM 7TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC), 2014, : 345 - 354
  • [26] Inferring compound heterozygosity from large-scale exome sequencing data
    Michael H. Guo
    Laurent C. Francioli
    Sarah L. Stenton
    Julia K. Goodrich
    Nicholas A. Watts
    Moriel Singer-Berk
    Emily Groopman
    Philip W. Darnowsky
    Matthew Solomonson
    Samantha Baxter
    Grace Tiao
    Benjamin M. Neale
    Joel N. Hirschhorn
    Heidi L. Rehm
    Mark J. Daly
    Anne O’Donnell-Luria
    Konrad J. Karczewski
    Daniel G. MacArthur
    Kaitlin E. Samocha
    Nature Genetics, 2024, 56 : 152 - 161
  • [27] Inferring compound heterozygotes from large-scale exome sequencing data
    Francioli, L. C.
    Guo, M. H.
    Karczewski, K. J.
    Cummings, B. B.
    Lek, M.
    Thaker, V.
    Daly, M. J.
    Hirschhorn, J. J.
    MacArthur, D. G.
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2019, 27 : 800 - 800
  • [28] Inferring compound heterozygosity from large-scale exome sequencing data
    Guo, Michael H.
    Francioli, Laurent C.
    Stenton, Sarah L.
    Goodrich, Julia K.
    Watts, Nicholas A.
    Singer-Berk, Moriel
    Groopman, Emily
    Darnowsky, Philip W.
    Solomonson, Matthew
    Baxter, Samantha
    Tiao, Grace
    Neale, Benjamin M.
    Hirschhorn, Joel N.
    Rehm, Heidi L.
    Daly, Mark J.
    O'Donnell-Luria, Anne
    Karczewski, Konrad J.
    MacArthur, Daniel G.
    Samocha, Kaitlin E.
    NATURE GENETICS, 2024, 56 (01) : 152 - 161
  • [29] Large-scale concatenation cDNA sequencing
    Yu, W
    Andersson, B
    Worley, KC
    Muzny, DM
    Ding, Y
    Liu, W
    Ricafrente, JY
    Wentland, MA
    Lennon, G
    Gibbs, RA
    GENOME RESEARCH, 1997, 7 (04): : 353 - 358
  • [30] Large-scale EST sequencing in rice
    Yamamoto, K
    Sasaki, T
    PLANT MOLECULAR BIOLOGY, 1997, 35 (1-2) : 135 - 144