A reference-free approach to analyse RADseq data using standard next generation sequencing toolkits

被引:14
|
作者
Heller, Rasmus [1 ]
Nursyifa, Casia [1 ]
Garcia-Erill, Genis [1 ]
Salmona, Jordi [2 ]
Chikhi, Lounes [2 ,3 ]
Meisner, Jonas [1 ]
Korneliussen, Thorfinn Sand [4 ]
Albrechtsen, Anders [1 ]
机构
[1] Univ Copenhagen, Dept Biol, Sect Computat & RNA Biol, DK-2200 Copenhagen N, Denmark
[2] Univ Paul Sabatier, CNRS, ENFA, UMR 5174 EDB,Lab Evolut & Div Biol, Toulouse, France
[3] Inst Gulbenkian Ciencias, Oeiras, Portugal
[4] Univ Copenhagen, GLOBE Inst, Sect GeoGenet, Copenhagen K, Denmark
关键词
allelic dropout; genetic diversity; genotype calling; genotype likelihood; RADseq; site frequency spectrum; READ ALIGNMENT; DISCOVERY; ASSOCIATION; DIVERSITY; FRAMEWORK; GENOTYPE; MAPS; SET;
D O I
10.1111/1755-0998.13324
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Genotyping-by-sequencing methods such as RADseq are popular for generating genomic and population-scale data sets from a diverse range of organisms. These often lack a usable reference genome, restricting users to RADseq specific software for processing. However, these come with limitations compared to generic next generation sequencing (NGS) toolkits. Here, we describe and test a simple pipeline for reference-free RADseq data processing that blends de novo elements from STACKS with the full suite of state-of-the art NGS tools. Specifically, we use the de novo RADseq assembly employed by STACKS to create a catalogue of RAD loci that serves as a reference for read mapping, variant calling and site filters. Using RADseq data from 28 zebra sequenced to similar to 8x depth-of-coverage we evaluate our approach by comparing the site frequency spectra (SFS) to those from alternative pipelines. Most pipelines yielded similar SFS at 8x depth, but only a genotype likelihood based pipeline performed similarly at low sequencing depth (2-4x). We compared the RADseq SFS with medium-depth (similar to 13x) shotgun sequencing of eight overlapping samples, revealing that the RADseq SFS was persistently slightly skewed towards rare and invariant alleles. Using simulations and human data we confirm that this is expected when there is allelic dropout (AD) in the RADseq data. AD in the RADseq data caused a heterozygosity deficit of similar to 16%, which dropped to similar to 5% after filtering AD. Hence, AD was the most important source of bias in our RADseq data.
引用
收藏
页码:1085 / 1097
页数:13
相关论文
共 50 条
  • [31] Analysis of transcription readthrough using next generation sequencing data
    Iwata, Hiroaki
    Sato, Tetsuya
    Suyama, Mikita
    GENES & GENETIC SYSTEMS, 2012, 87 (06) : 418 - 418
  • [32] In silico secretome analysis approach for next generation sequencing transcriptomic data
    Garg, Gagan
    Ranganathan, Shoba
    BMC GENOMICS, 2011, 12
  • [33] In silico secretome analysis approach for next generation sequencing transcriptomic data
    Gagan Garg
    Shoba Ranganathan
    BMC Genomics, 12
  • [34] RUbioSeq plus : A multiplatform application that executes parallelized pipelines to analyse next-generation sequencing data
    Rubio-Camarillo, Miriam
    Lopez-Fernandez, Hugo
    Gomez-Lopez, Gonzalo
    Carro, Angel
    Maria Fernandez, Jose
    Fustero Torre, Coral
    Fdez-Riverola, Florentino
    Glez-Pena, Daniel
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2017, 138 : 73 - 81
  • [35] LRCB: A Comprehensive Benchmark Evaluation of Reference-free Lossless Compression Tools for Genomics Sequencing Long Reads Data
    Sun, Hui
    Ma, Huidong
    Zheng, Yingfeng
    Xie, Haonan
    Yan, Meng
    Zhong, Cheng
    Liu, Xiaoguang
    Wang, Gang
    2024 DATA COMPRESSION CONFERENCE, DCC, 2024, : 584 - 584
  • [36] RHCE genotyping using next generation sequencing: Allele specific reference sequences
    Tounsi, Wajnat A.
    Halawani, Amr J.
    Sillence, Kelly A.
    Kiernan, Michele
    Avent, Neil D.
    Madgett, Tracey E.
    TRANSFUSION, 2025, 65 (02) : 363 - 374
  • [37] Reference-Free Plant Disease Detection Using Machine Learning and Long-Read Metagenomic Sequencing
    Johnson, Marcela A.
    Vinatzer, Boris A.
    Li, Song
    APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2023, 89 (06)
  • [38] The Importance of Reference Transcripts When Annotating Clinical Next-Generation Sequencing Data
    Danilova, O. V.
    Peterson, J. D.
    de Abreu, F. B.
    Kaur, P.
    Ornstein, D. L.
    Tsongalis, G. J.
    JOURNAL OF MOLECULAR DIAGNOSTICS, 2014, 16 (06): : 780 - 780
  • [39] Determining the quality and complexity of next-generation sequencing data without a reference genome
    Anvar, Seyed Yahya
    Khachatryan, Lusine
    Vermaat, Martijn
    van Galen, Michiel
    Pulyakhina, Irina
    Ariyurek, Yavuz
    Kraaijeveld, Ken
    den Dunnen, Johan T.
    de Knijff, Peter
    't Hoen, Peter Ac
    Laros, Jeroen F. J.
    GENOME BIOLOGY, 2014, 15 (12): : 555
  • [40] Determining the quality and complexity of next-generation sequencing data without a reference genome
    Seyed Yahya Anvar
    Lusine Khachatryan
    Martijn Vermaat
    Michiel van Galen
    Irina Pulyakhina
    Yavuz Ariyurek
    Ken Kraaijeveld
    Johan T den Dunnen
    Peter de Knijff
    Peter AC ’t Hoen
    Jeroen FJ Laros
    Genome Biology, 15