CHARR efficiently estimates contamination from DNA sequencing data

被引:3
|
作者
Lu, Wenhan [1 ,2 ,3 ]
Gauthier, Laura D. [1 ,4 ]
Poterba, Timothy [1 ,2 ,3 ]
Giacopuzzi, Edoardo [5 ]
Goodrich, Julia K. [1 ,2 ]
Stevens, Christine R. [1 ,2 ,3 ]
King, Daniel [1 ,2 ,3 ]
Daly, Mark J. [1 ,2 ,3 ,6 ]
Neale, Benjamin M. [1 ,2 ,3 ,7 ]
Karczewski, Konrad J. [1 ,2 ,7 ]
机构
[1] Broad Inst MIT & Harvard, Program Med & Populat Genet, Cambridge, MA 02142 USA
[2] Massachusetts Gen Hosp, Analyt & Translat Genet Unit, Boston, MA 02114 USA
[3] Broad Inst MIT & Harvard, Stanley Ctr Psychiat Res, Cambridge, MA 02142 USA
[4] Broad Inst MIT & Harvard, Data Sci Platform, Cambridge, MA 02142 USA
[5] Human Technopole, Viale Rita Levi Montalcini 1, I-20157 Milan, Italy
[6] Inst Mol Med Finland, Helsinki, Finland
[7] Broad Inst MIT & Harvard, Novo Nordisk Fdn Ctr Genom Mech Dis, Cambridge, MA 02142 USA
关键词
SAMPLES;
D O I
10.1016/j.ajhg.2023.10.011
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
DNA sample contamination is a major issue in clinical and research applications of whole-genome and-exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and-exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.
引用
收藏
页码:2068 / 2076
页数:9
相关论文
共 50 条
  • [41] Accurate genome-scale percentage DNA methylation estimates from microarray data
    Aryee, Martin J.
    Wu, Zhijin
    Ladd-Acosta, Christine
    Herb, Brian
    Feinberg, Andrew P.
    Yegnasubramanian, Srinivasan
    Irizarry, Rafael A.
    BIOSTATISTICS, 2011, 12 (02) : 197 - 210
  • [42] Data Interoperability of Whole Exome Sequencing (WES) Based Mutational Burden Estimates from Different Laboratories
    Qiu, Ping
    Pang, Ling
    Arreaza, Gladys
    Maguire, Maureen
    Chang, Ken C. N.
    Marton, Matthew J.
    Levitan, Diane
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2016, 17 (05)
  • [43] cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries
    Qi, Meifang
    Nayar, Utthara
    Ludwig, Leif S.
    Wagle, Nikhil
    Rheinbay, Esther
    BMC BIOINFORMATICS, 2021, 22 (01)
  • [44] cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries
    Meifang Qi
    Utthara Nayar
    Leif S. Ludwig
    Nikhil Wagle
    Esther Rheinbay
    BMC Bioinformatics, 22
  • [45] RepairNatrix: a Snakemake workflow for processing DNA sequencing data for DNA storage
    Schwarz, Peter Michael
    Welzel, Marius
    Heider, Dominik
    Freisleben, Bernd
    BIOINFORMATICS ADVANCES, 2023, 3 (01):
  • [46] THE APPLICATION OF NUMERICAL ESTIMATES OF BASE CALLING ACCURACY TO DNA-SEQUENCING PROJECTS
    BONFIELD, JK
    STADEN, R
    NUCLEIC ACIDS RESEARCH, 1995, 23 (08) : 1406 - 1410
  • [47] BiQ analyzer: visualization and quality control for DNA methylation data from bisulfite sequencing
    Bock, C
    Reither, S
    Mikeska, T
    Paulsen, M
    Walter, J
    Lengauer, T
    BIOINFORMATICS, 2005, 21 (21) : 4067 - 4068
  • [48] Accurate and scalable variant calling from single cell DNA sequencing data with ProSolo
    Laehnemann, David
    Koester, Johannes
    Fischer, Ute
    Borkhardt, Arndt
    McHardy, Alice C.
    Schoenhuth, Alexander
    NATURE COMMUNICATIONS, 2021, 12 (01)
  • [49] A BAYESIAN NONPARAMETRIC MODEL FOR INFERRING SUBCLONAL POPULATIONS FROM STRUCTURED DNA SEQUENCING DATA
    He, Shai
    Schein, Aaron
    Sarsani, Vishal
    Flaherty, Patrick
    ANNALS OF APPLIED STATISTICS, 2021, 15 (02): : 925 - 951
  • [50] diffloop: a computational framework for identifying and analyzing differential DNA loops from sequencing data
    Lareau, Caleb A.
    Aryee, Martin J.
    BIOINFORMATICS, 2018, 34 (04) : 672 - 674