Reference-Free Validation of Short Read Data

被引:20
|
作者
Schroeder, Jan [1 ,2 ]
Bailey, James [1 ,2 ]
Conway, Thomas [2 ]
Zobel, Justin [1 ,2 ]
机构
[1] Univ Melbourne, Dept Comp Sci & Software Engn, Parkville, Vic 3052, Australia
[2] NICTA Victoria Res Lab, Parkville, Vic, Australia
来源
PLOS ONE | 2010年 / 5卷 / 09期
基金
澳大利亚研究理事会;
关键词
GENOME;
D O I
10.1371/journal.pone.0012681
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of some of these biases are not always known. Accurate assessment of bias is required for experimental quality control, genome assembly, and interpretation of coverage results. An additional challenge is that, for new genomes or material from an unidentified source, there may be no reference available against which the reads can be checked. Results: We propose analytical methods for identifying biases in a collection of short reads, without recourse to a reference. These, in conjunction with existing approaches, comprise a methodology that can be used to quantify the quality of a set of reads. Our methods involve use of three different measures: analysis of base calls; analysis of k-mers; and analysis of distributions of k-mers. We apply our methodology to wide range of short read data and show that, surprisingly, strong biases appear to be present. These include gross overrepresentation of some poly-base sequences, per-position biases towards some bases, and apparent preferences for some starting positions over others. Conclusions: The existence of biases in short read data is known, but they appear to be greater and more diverse than identified in previous literature. Statistical analysis of a set of short reads can help identify issues prior to assembly or resequencing, and should help guide chemical or statistical methods for bias rectification.
引用
收藏
页码:1 / 11
页数:11
相关论文
共 50 条
  • [41] Reference-free prediction of rearrangement breakpoint reads
    Wijaya, Edward
    Shimizu, Kana
    Asai, Kiyoshi
    Hamada, Michiaki
    BIOINFORMATICS, 2014, 30 (18) : 2559 - 2567
  • [42] Reference-Free Comparative Genomics of 174 Chloroplasts
    Kua, Chai-Shian
    Ruan, Jue
    Harting, John
    Ye, Cheng-Xi
    Helmus, Matthew R.
    Yu, Jun
    Cannon, Charles H.
    PLOS ONE, 2012, 7 (11):
  • [43] A Scalable Reference-Free Metagenomic Binning Pipeline
    Ma, Terry
    Xing, Xin
    BIOINFORMATICS RESEARCH AND APPLICATIONS, ISBRA 2018, 2018, 10847 : 79 - 83
  • [44] Reference-free detection of semiconductor assembly defect
    Ng, ANT
    Lam, EY
    Chung, R
    Fung, KSM
    Leung, WH
    MACHINE VISION APPLICATIONS IN INDUSTRIAL INSPECTION XIII, 2005, 5679 : 27 - 35
  • [45] Decision Making for Reference-Free Damage Detection
    Hajrya, R.
    Kopsaftopoulos, F.
    Roy, S.
    Ladpli, P.
    Chang, F. -K.
    STRUCTURAL HEALTH MONITORING 2015: SYSTEM RELIABILITY FOR VERIFICATION AND IMPLEMENTATION, VOLS. 1 AND 2, 2015, : 2964 - 2971
  • [46] Reference-free inferring of transcriptomic events in cancer cells on single-cell data
    Eralp, Batuhan
    Sefer, Emre
    BMC CANCER, 2024, 24 (01)
  • [47] Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
    Gaëtan Benoit
    Claire Lemaitre
    Dominique Lavenier
    Erwan Drezen
    Thibault Dayris
    Raluca Uricaru
    Guillaume Rizk
    BMC Bioinformatics, 16
  • [48] Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data
    Saulo Alves Aflitos
    Edouard Severing
    Gabino Sanchez-Perez
    Sander Peters
    Hans de Jong
    Dick de Ridder
    BMC Bioinformatics, 16
  • [49] Reference-free inference of tumor phylogenies from single-cell sequencing data
    Ayshwarya Subramanian
    Russell Schwartz
    BMC Genomics, 16
  • [50] Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data
    Aflitos, Saulo Alves
    Severing, Edouard
    Sanchez-Perez, Gabino
    Peters, Sander
    de Jong, Hans
    de Ridder, Dick
    BMC BIOINFORMATICS, 2015, 16