CHARR efficiently estimates contamination from DNA sequencing data

被引:3
|
作者
Lu, Wenhan [1 ,2 ,3 ]
Gauthier, Laura D. [1 ,4 ]
Poterba, Timothy [1 ,2 ,3 ]
Giacopuzzi, Edoardo [5 ]
Goodrich, Julia K. [1 ,2 ]
Stevens, Christine R. [1 ,2 ,3 ]
King, Daniel [1 ,2 ,3 ]
Daly, Mark J. [1 ,2 ,3 ,6 ]
Neale, Benjamin M. [1 ,2 ,3 ,7 ]
Karczewski, Konrad J. [1 ,2 ,7 ]
机构
[1] Broad Inst MIT & Harvard, Program Med & Populat Genet, Cambridge, MA 02142 USA
[2] Massachusetts Gen Hosp, Analyt & Translat Genet Unit, Boston, MA 02114 USA
[3] Broad Inst MIT & Harvard, Stanley Ctr Psychiat Res, Cambridge, MA 02142 USA
[4] Broad Inst MIT & Harvard, Data Sci Platform, Cambridge, MA 02142 USA
[5] Human Technopole, Viale Rita Levi Montalcini 1, I-20157 Milan, Italy
[6] Inst Mol Med Finland, Helsinki, Finland
[7] Broad Inst MIT & Harvard, Novo Nordisk Fdn Ctr Genom Mech Dis, Cambridge, MA 02142 USA
关键词
SAMPLES;
D O I
10.1016/j.ajhg.2023.10.011
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
DNA sample contamination is a major issue in clinical and research applications of whole-genome and-exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and-exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.
引用
收藏
页码:2068 / 2076
页数:9
相关论文
共 50 条
  • [31] gemBS: high throughput processing for DNA methylation data from bisulfite sequencing
    Merkel, Angelika
    Fernandez-Callejo, Marcos
    Casals, Eloi
    Marco-Sola, Santiago
    Schuyler, Ronald
    Gut, Ivo G.
    Heath, Simon C.
    BIOINFORMATICS, 2019, 35 (05) : 737 - 742
  • [32] Estimating DNA methylation potential energy landscapes from nanopore sequencing data
    Abante, Jordi
    Kambhampati, Sandeep
    Feinberg, Andrew P.
    Goutsias, John
    SCIENTIFIC REPORTS, 2021, 11 (01)
  • [33] Sequencing data of the mitochondrial DNA control region from mother and child samples
    Huhne, J
    Pfeiffer, H
    Rand, S
    Brinkmann, B
    PROGRESS IN FORENSIC GENETICS 7, 1998, 1167 : 463 - 465
  • [34] ABI sequencing analysis -: Manipulation of sequence data from the ABI DNA sequencer
    Hagemann, TL
    Kwan, SP
    MOLECULAR BIOTECHNOLOGY, 1999, 13 (02) : 137 - 152
  • [35] Efficiently Combining Data from Various Sources
    Nkurunziza, Severien
    PROCEEDINGS OF THE THIRTEENTH INTERNATIONAL CONFERENCE ON MANAGEMENT SCIENCE AND ENGINEERING MANAGEMENT, VOL 1, 2020, 1001 : 198 - 210
  • [36] Nanopore Sequencing Simulator for DNA Data Storage
    San Antonio, Eva Gil
    Heinis, Thomas
    Carteron, Louis
    Dimopoulou, Melpomeni
    Antonini, Marc
    2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
  • [37] Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications
    Ziwen He
    Xinnian Li
    Shaoping Ling
    Yun-Xin Fu
    Eric Hungate
    Suhua Shi
    Chung-I Wu
    BMC Genomics, 14
  • [38] Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications
    He, Ziwen
    Li, Xinnian
    Ling, Shaoping
    Fu, Yun-Xin
    Hungate, Eric
    Shi, Suhua
    Wu, Chung-I
    BMC GENOMICS, 2013, 14
  • [39] Efficiently identifying genome-wide changes with next-generation sequencing data
    Huang, Weichun
    Umbach, David M.
    Jordan, Nicole Vincent
    Abell, Amy N.
    Johnson, Gary L.
    Li, Leping
    NUCLEIC ACIDS RESEARCH, 2011, 39 (19)
  • [40] MAXIMUM-LIKELIHOOD-ESTIMATES OF SELECTION COEFFICIENTS FROM DNA-SEQUENCE DATA
    GOLDING, B
    EVOLUTION, 1993, 47 (05) : 1420 - 1431