Background: High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of some of these biases are not always known. Accurate assessment of bias is required for experimental quality control, genome assembly, and interpretation of coverage results. An additional challenge is that, for new genomes or material from an unidentified source, there may be no reference available against which the reads can be checked. Results: We propose analytical methods for identifying biases in a collection of short reads, without recourse to a reference. These, in conjunction with existing approaches, comprise a methodology that can be used to quantify the quality of a set of reads. Our methods involve use of three different measures: analysis of base calls; analysis of k-mers; and analysis of distributions of k-mers. We apply our methodology to wide range of short read data and show that, surprisingly, strong biases appear to be present. These include gross overrepresentation of some poly-base sequences, per-position biases towards some bases, and apparent preferences for some starting positions over others. Conclusions: The existence of biases in short read data is known, but they appear to be greater and more diverse than identified in previous literature. Statistical analysis of a set of short reads can help identify issues prior to assembly or resequencing, and should help guide chemical or statistical methods for bias rectification.
机构:
Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94721 USAUniv Calif Berkeley, Div Comp Sci, Berkeley, CA 94721 USA
Kao, Wei-Chun
Chan, Andrew H.
论文数: 0引用数: 0
h-index: 0
机构:
Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94721 USAUniv Calif Berkeley, Div Comp Sci, Berkeley, CA 94721 USA
Chan, Andrew H.
Song, Yun S.
论文数: 0引用数: 0
h-index: 0
机构:
Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94721 USA
Univ Calif Berkeley, Dept Stat, Berkeley, CA 94721 USAUniv Calif Berkeley, Div Comp Sci, Berkeley, CA 94721 USA
机构:
BC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Univ British Columbia, Bioinformat Grad Program, Vancouver, BC V5Z 4S6, CanadaBC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Nip, Ka Ming
Hafezqorani, Saber
论文数: 0引用数: 0
h-index: 0
机构:
BC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Univ British Columbia, Bioinformat Grad Program, Vancouver, BC V5Z 4S6, CanadaBC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Hafezqorani, Saber
Gagalova, Kristina K.
论文数: 0引用数: 0
h-index: 0
机构:
BC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Univ British Columbia, Bioinformat Grad Program, Vancouver, BC V5Z 4S6, CanadaBC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Gagalova, Kristina K.
Chiu, Readman
论文数: 0引用数: 0
h-index: 0
机构:
BC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, CanadaBC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Chiu, Readman
Yang, Chen
论文数: 0引用数: 0
h-index: 0
机构:
BC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Univ British Columbia, Bioinformat Grad Program, Vancouver, BC V5Z 4S6, CanadaBC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Yang, Chen
Warren, Rene L.
论文数: 0引用数: 0
h-index: 0
机构:
BC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, CanadaBC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Warren, Rene L.
Birol, Inanc
论文数: 0引用数: 0
h-index: 0
机构:
BC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
Univ British Columbia, Dept Med Genet, Vancouver, BC V6T 1Z3, CanadaBC Canc, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada