CHARR efficiently estimates contamination from DNA sequencing data

被引:3
|
作者
Lu, Wenhan [1 ,2 ,3 ]
Gauthier, Laura D. [1 ,4 ]
Poterba, Timothy [1 ,2 ,3 ]
Giacopuzzi, Edoardo [5 ]
Goodrich, Julia K. [1 ,2 ]
Stevens, Christine R. [1 ,2 ,3 ]
King, Daniel [1 ,2 ,3 ]
Daly, Mark J. [1 ,2 ,3 ,6 ]
Neale, Benjamin M. [1 ,2 ,3 ,7 ]
Karczewski, Konrad J. [1 ,2 ,7 ]
机构
[1] Broad Inst MIT & Harvard, Program Med & Populat Genet, Cambridge, MA 02142 USA
[2] Massachusetts Gen Hosp, Analyt & Translat Genet Unit, Boston, MA 02114 USA
[3] Broad Inst MIT & Harvard, Stanley Ctr Psychiat Res, Cambridge, MA 02142 USA
[4] Broad Inst MIT & Harvard, Data Sci Platform, Cambridge, MA 02142 USA
[5] Human Technopole, Viale Rita Levi Montalcini 1, I-20157 Milan, Italy
[6] Inst Mol Med Finland, Helsinki, Finland
[7] Broad Inst MIT & Harvard, Novo Nordisk Fdn Ctr Genom Mech Dis, Cambridge, MA 02142 USA
关键词
SAMPLES;
D O I
10.1016/j.ajhg.2023.10.011
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
DNA sample contamination is a major issue in clinical and research applications of whole-genome and-exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and-exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.
引用
收藏
页码:2068 / 2076
页数:9
相关论文
共 50 条
  • [21] Consistent RNA sequencing contamination in GTEx and other data sets
    Tim O. Nieuwenhuis
    Stephanie Y. Yang
    Rohan X. Verma
    Vamsee Pillalamarri
    Dan E. Arking
    Avi Z. Rosenberg
    Matthew N. McCall
    Marc K. Halushka
    Nature Communications, 11
  • [22] Consistent RNA sequencing contamination in GTEx and other data sets
    Nieuwenhuis, Tim O.
    Yang, Stephanie Y.
    Verma, Rohan X.
    Pillalamarri, Vamsee
    Arking, Dan E.
    Rosenberg, Avi Z.
    McCall, Matthew N.
    Halushka, Marc K.
    NATURE COMMUNICATIONS, 2020, 11 (01)
  • [23] LSTrAP: efficiently combining RNA sequencing data into co-expression networks
    Sebastian Proost
    Agnieszka Krawczyk
    Marek Mutwil
    BMC Bioinformatics, 18
  • [24] Estimation of intrafamilial DNA contamination in family trio genome sequencing using deviation from Mendelian inheritance
    Yoon, Christopher J.
    Kim, Su Yeon
    Nam, Chang Hyun
    Lee, Junehawk
    Park, Jung Woo
    Mun, Jihyeob
    Park, Seongyeol
    Lee, Soyoung
    Yi, Boram
    Min, Kyoung Il
    Wiley, Brian
    Bolton, Kelly L.
    Lee, Jeong Ho
    Kim, Eunjoon
    Yoo, Hee Jeong
    Jun, Jong Kwan
    Choi, Ji Seon
    Griffith, Malachi
    Griffith, Obi L.
    Ju, Young Seok
    GENOME RESEARCH, 2022, 32 (11-12) : 2134 - 2144
  • [25] LSTrAP: efficiently combining RNA sequencing data into co-expression networks
    Proost, Sebastian
    Krawczyk, Agnieszka
    Mutwil, Marek
    BMC BIOINFORMATICS, 2017, 18
  • [26] Haplotect: A Robust Haplotype-Based Method to Detect and Quantify DNA Contamination in Next-Generation Sequencing Data
    Hughes, A.
    Spencer, D.
    Zarbock, C.
    Alnoor, F.
    Abel, H.
    Duncavage, E.
    JOURNAL OF MOLECULAR DIAGNOSTICS, 2024, 26 (11): : S82 - S82
  • [27] Using QC-Blind for Quality Control and Contamination Screening of Bacteria DNA Sequencing Data Without Reference Genome
    Xi, Wang
    Gao, Yan
    Cheng, Zhangyu
    Chen, Chaoyun
    Han, Maozhen
    Yang, Pengshuo
    Xiong, Guangzhou
    Ning, Kang
    FRONTIERS IN MICROBIOLOGY, 2019, 10
  • [28] Estimating DNA methylation potential energy landscapes from nanopore sequencing data
    Jordi Abante
    Sandeep Kambhampati
    Andrew P. Feinberg
    John Goutsias
    Scientific Reports, 11
  • [29] Somatic variant calling from single-cell DNA sequencing data
    Valecha, Monica
    Posada, David
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2022, 20 : 2978 - 2985
  • [30] GeneFuse: detection and visualization of target gene fusions from DNA sequencing data
    Chen, Shifu
    Liu, Ming
    Huang, Tanxiao
    Liao, Wenting
    Xu, Mingyan
    Gu, Jia
    INTERNATIONAL JOURNAL OF BIOLOGICAL SCIENCES, 2018, 14 (08): : 843 - 848