Error and Error Mitigation in Low-Coverage Genome Assemblies

被引:30
|
作者
Hubisz, Melissa J. [1 ]
Lin, Michael F. [2 ,3 ]
Kellis, Manolis [2 ,3 ,4 ]
Siepel, Adam [1 ,5 ]
机构
[1] Cornell Univ, Dept Biol Stat & Computat Biol, Ithaca, NY 14853 USA
[2] MIT, Broad Inst, Cambridge, MA 02139 USA
[3] Harvard Univ, Cambridge, MA 02138 USA
[4] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
[5] Cornell Univ, Cornell Ctr Comparat & Populat Genom, Ithaca, NY USA
来源
PLOS ONE | 2011年 / 6卷 / 02期
关键词
DNA-SEQUENCES; ACCURACY; IDENTIFICATION; ALIGNMENTS; ARACHNE; MOUSE; TREES;
D O I
10.1371/journal.pone.0017034
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to similar to 2x coverage. Here we examine the extent of sequencing error in these 2x assemblies, and its potential impact in downstream analyses. By comparing 2x assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1-4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of overcorrection, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Genotype error due to low-coverage sequencing induces uncertainty in polygenic scoring
    Petter, Ella
    Ding, Yi
    Hou, Kangcheng
    Bhattacharya, Arjun
    Gusev, Alexander
    Zaitlen, Noah
    Pasaniuc, Bogdan
    AMERICAN JOURNAL OF HUMAN GENETICS, 2023, 110 (08) : 1319 - 1329
  • [2] Likelihood-based inference of population history from low-coverage de novo genome assemblies
    Hearn, Jack
    Stone, Graham N.
    Bunnefeld, Lynsey
    Nicholls, James A.
    Barton, Nicholas H.
    Lohse, Konrad
    MOLECULAR ECOLOGY, 2014, 23 (01) : 198 - 211
  • [3] Phylogenomics from low-coverage whole-genome sequencing
    Zhang, Feng
    Ding, Yinhuan
    Zhu, Chao-Dong
    Zhou, Xin
    Orr, Michael C.
    Scheu, Stefan
    Luan, Yun-Xia
    METHODS IN ECOLOGY AND EVOLUTION, 2019, 10 (04): : 507 - 517
  • [4] Batch effects in population genomic studies with low-coverage whole genome sequencing data: Causes, detection and mitigation
    Lou, Runyang Nicolas
    Therkildsen, Nina Overgaard
    MOLECULAR ECOLOGY RESOURCES, 2022, 22 (05) : 1678 - 1692
  • [5] A Novel Approach to Estimating Heterozygosity from Low-Coverage Genome Sequence
    Bryc, Katarzyna
    Patterson, Nick
    Reich, David
    GENETICS, 2013, 195 (02) : 553 - 561
  • [6] Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT
    Sarmashghi, Shahab
    Balaban, Metin
    Rachtman, Eleonora
    Touri, Behrouz
    Mirarab, Siavash
    Bafna, Vineet
    PLOS COMPUTATIONAL BIOLOGY, 2021, 17 (11)
  • [7] Ultra low-coverage whole genome sequencing for precision oncology in solid tumors
    Tarawneh, T. S.
    Rodepeter, F.
    Ross, P.
    Teply-Szymanski, J.
    Koch, V.
    Thoelken, C.
    Knorrenschild, J. Riera
    Wuendisch, T.
    Denkert, C.
    Neubauer, A.
    Mack, E.
    ANNALS OF ONCOLOGY, 2022, 33 (08) : S1415 - S1416
  • [8] A beginner's guide to low-coverage whole genome sequencing for population genomics
    Lou, Runyang Nicolas
    Jacobs, Arne
    Wilder, Aryn P.
    Therkildsen, Nina Overgaard
    MOLECULAR ECOLOGY, 2021, 30 (23) : 5966 - 5993
  • [9] Assembly of the Mitochondrial Genome in the Campanulaceae Family Using Illumina Low-Coverage Sequencing
    Lee, Hyun-Oh
    Choi, Ji-Weon
    Baek, Jeong-Ho
    Oh, Jae-Hyeon
    Lee, Sang-Choon
    Kim, Chang-Kug
    GENES, 2018, 9 (08)
  • [10] Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies
    Denton, James F.
    Lugo-Martinez, Jose
    Tucker, Abraham E.
    Schrider, Daniel R.
    Warren, Wesley C.
    Hahn, Matthew W.
    PLOS COMPUTATIONAL BIOLOGY, 2014, 10 (12)