Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case

被引：45

作者：

Wang, Weiwen ^{[1
]}

Schalamun, Miriam ^{[1
,2
]}

Morales-Suarez, Alejandro ^{[3
]}

Kainer, David ^{[1
]}

Schwessinger, Benjamin ^{[1
]}

Lanfear, Robert ^{[1
]}

机构：

[1] Australian Natl Univ, Res Sch Biol, Canberra, ACT, Australia

[2] Univ Nat Resources & Life Sci, Inst Appl Genet & Cell Biol, Vienna, Austria

[3] Macquarie Univ, Dept Biol Sci, Sydney, NSW, Australia

来源：

BMC GENOMICS | 2018年 / 19卷

基金：

澳大利亚研究理事会;

关键词：

Chloroplast genome; Genome assembly; Polishing; Illumina; Long-reads; Nanopore; HIGH-THROUGHPUT; PLASTID GENOME; DNA INSERTIONS; PHYLOGENY; MITOCHONDRIAL; EVOLUTION; SEQUENCE; ORGANIZATION; GENERATION; VERSATILE;

D O I：

10.1186/s12864-018-5348-8

中图分类号：

Q81 [生物工程学（生物技术）]; Q93 [微生物学];

学科分类号：

071005 ; 0836 ; 090102 ; 100705 ;

摘要：

BackgroundChloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10-30kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly.ResultsHybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate. The chloroplast genome of Eucalyptus pauciflora is 159,942bp, contains 131 genes of known function.ConclusionsOur results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least similar to 5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies.

引用

页数：15

共 38 条

[1] Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
Weiwen Wang
Miriam Schalamun
Alejandro Morales-Suarez
David Kainer
Benjamin Schwessinger
Robert Lanfear
BMC Genomics, 19
[2] Comparison of long- and short-read metagenomic assembly for low-abundance species and resistance genes
Yorki, Sosie
Shea, Terrance
Cuomo, Christina A.
Walker, Bruce J.
LaRocque, Regina C.
Manson, Abigail L.
Earl, Ashlee M.
Worby, Colin J.
BRIEFINGS IN BIOINFORMATICS, 2023, 24 (02)
[3] Comparing genomes recovered from time-series metagenomes using long- and short-read sequencing technologies
Orellana, Luis H.
Krueger, Karen
Sidhu, Chandni
Amann, Rudolf
MICROBIOME, 2023, 11 (01)
[4] Comparing genomes recovered from time-series metagenomes using long- and short-read sequencing technologies
Luis H. Orellana
Karen Krüger
Chandni Sidhu
Rudolf Amann
Microbiome, 11
[5] Detecting Pharmacogenomic Variants Using Long- and Short-Read Next Generation Sequencing Platforms
Schumacher, C. A.
Wood, A.
Sandhu, S.
Lenhart, J.
Kurihara, L.
Makarov, V.
Harkins, T.
JOURNAL OF MOLECULAR DIAGNOSTICS, 2017, 19 (06): : 946 - 947
[6] Common workflow language (CWL)-based software pipeline for de novo genome assembly from long- and short-read data
Korhonen, Pasi K.
Hall, Ross S.
Young, Neil D.
Gasser, Robin B.
GIGASCIENCE, 2019, 8 (04):
[7] Inferring Species Compositions of Complex Fungal Communities from Long- and Short-Read Sequence Data
Hu, Yiheng
Irinyi, Laszlo
Minh Thuy Vi Hoang
Eenjes, Tavish
Graetz, Abigail
Stone, Eric A.
Meyer, Wieland
Schwessinger, Benjamin
Rathjen, John P.
MBIO, 2022, 13 (02):
[8] Improved genome assembly of the whiteleg shrimp Penaeus (Litopenaeus) vannamei using long- and short-read sequences from public databases
Perez-Enriquez, Ricardo
Juarez, Oscar E.
Galindo-Torres, Pavel
Vargas-Aguilar, Ana Luisa
Llera-Herrera, Raul
JOURNAL OF HEREDITY, 2024, 115 (03) : 302 - 310
[9] Complete Genome Resequencing of Thermus thermophilus Strain TMY by Hybrid Assembly of Long- and Short-Read Sequencing Technologies
Miyazaki, Kentaro
Tokito, Natsuko
MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 2021, 10 (46):
[10] Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data
Lavrichenko, Ksenia
Johansson, Stefan
Jonassen, Inge
BMC GENOMICS, 2021, 22 (01)

← 1 2 3 4 →