Impact of short-read sequencing on the misassembly of a plant genome

被引:4
|
作者
Wang, Peipei [1 ,2 ]
Meng, Fanrui [1 ,2 ]
Moore, Bethany M. [1 ,3 ]
Shiu, Shin-Han [1 ,2 ,3 ,4 ]
机构
[1] Michigan State Univ, Dept Plant Biol, E Lansing, MI 48824 USA
[2] Michigan State Univ, DOE Great Lake Bioenergy Res Ctr, E Lansing, MI 48824 USA
[3] Michigan State Univ, Ecol Evolut & Behav Biol Program, E Lansing, MI 48824 USA
[4] Michigan State Univ, Dept Computat Math Sci & Engn, E Lansing, MI 48824 USA
基金
美国国家科学基金会;
关键词
Genome misassembly; Read coverage; Machine learning; Solanum lycopersicum; QUALITY ASSESSMENT; DNA; EVOLUTION; SIGNATURES; TOOL;
D O I
10.1186/s12864-021-07397-5
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1Mb) and 9.7% (79.6Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Impact of short-read sequencing on the misassembly of a plant genome
    Peipei Wang
    Fanrui Meng
    Bethany M. Moore
    Shin-Han Shiu
    BMC Genomics, 22
  • [2] Blindspots in short-read genome sequencing for classic chromosomal rearrangements
    Gauthier, Lucas
    Caillot, Claire
    Pujalte, Mathilde
    Till, Marianne
    Sanlaville, Damien
    Chatron, Nicolas
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 1572 - 1572
  • [3] Assessing reproducibility of inherited variants detected with short-read whole genome sequencing
    Bohu Pan
    Luyao Ren
    Vitor Onuchic
    Meijian Guan
    Rebecca Kusko
    Steve Bruinsma
    Len Trigg
    Andreas Scherer
    Baitang Ning
    Chaoyang Zhang
    Christine Glidewell-Kenney
    Chunlin Xiao
    Eric Donaldson
    Fritz J. Sedlazeck
    Gary Schroth
    Gokhan Yavas
    Haiying Grunenwald
    Haodong Chen
    Heather Meinholz
    Joe Meehan
    Jing Wang
    Jingcheng Yang
    Jonathan Foox
    Jun Shang
    Kelci Miclaus
    Lianhua Dong
    Leming Shi
    Marghoob Mohiyuddin
    Mehdi Pirooznia
    Ping Gong
    Rooz Golshani
    Russ Wolfinger
    Samir Lababidi
    Sayed Mohammad Ebrahim Sahraeian
    Steve Sherry
    Tao Han
    Tao Chen
    Tieliu Shi
    Wanwan Hou
    Weigong Ge
    Wen Zou
    Wenjing Guo
    Wenjun Bao
    Wenzhong Xiao
    Xiaohui Fan
    Yoichi Gondo
    Ying Yu
    Yongmei Zhao
    Zhenqiang Su
    Zhichao Liu
    Genome Biology, 23
  • [4] Reconstruction of Acetogenesis Pathway Using Short-Read Sequencing of Clostridium aceticum Genome
    Lee, Sooin
    Song, Yoseb
    Choe, Donghui
    Cho, Suhyung
    Yu, Seok Jong
    Cho, Yongseong
    Kim, Sun Chang
    Cho, Byung-Kwan
    JOURNAL OF NANOSCIENCE AND NANOTECHNOLOGY, 2015, 15 (05) : 3852 - 3861
  • [5] The impact of reference panel short-read sequencing inaccessibility on genotype imputation
    Mitchell, J. S.
    Konig, E.
    Gogele, M.
    Pattaro, C.
    Pramstaller, P. P.
    Fuchsberger, C.
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2019, 27 : 514 - 514
  • [6] Assessing reproducibility of inherited variants detected with short-read whole genome sequencing
    Pan, Bohu
    Ren, Luyao
    Onuchic, Vitor
    Guan, Meijian
    Kusko, Rebecca
    Bruinsma, Steve
    Trigg, Len
    Scherer, Andreas
    Ning, Baitang
    Zhang, Chaoyang
    Glidewell-Kenney, Christine
    Xiao, Chunlin
    Donaldson, Eric
    Sedlazeck, Fritz J.
    Schroth, Gary
    Yavas, Gokhan
    Grunenwald, Haiying
    Chen, Haodong
    Meinholz, Heather
    Meehan, Joe
    Wang, Jing
    Yang, Jingcheng
    Foox, Jonathan
    Shang, Jun
    Miclaus, Kelci
    Dong, Lianhua
    Shi, Leming
    Mohiyuddin, Marghoob
    Pirooznia, Mehdi
    Gong, Ping
    Golshani, Rooz
    Wolfinger, Russ
    Lababidi, Samir
    Sahraeian, Sayed Mohammad Ebrahim
    Sherry, Steve
    Han, Tao
    Chen, Tao
    Shi, Tieliu
    Hou, Wanwan
    Ge, Weigong
    Zou, Wen
    Guo, Wenjing
    Bao, Wenjun
    Xiao, Wenzhong
    Fan, Xiaohui
    Gondo, Yoichi
    Yu, Ying
    Zhao, Yongmei
    Su, Zhenqiang
    Liu, Zhichao
    GENOME BIOLOGY, 2022, 23 (01)
  • [7] Short-Read Sequencing Technologies for Transcriptional Analyses
    Simon, Stacey A.
    Zhai, Jixian
    Nandety, Raja Sekhar
    McCormick, Kevin P.
    Zeng, Jia
    Mejia, Diego
    Meyers, Blake C.
    ANNUAL REVIEW OF PLANT BIOLOGY, 2009, 60 : 305 - 333
  • [8] Probability model for boundaries of short-read sequencing
    Schatz, Florian
    Wienbrandt, Lars
    Schimmler, Manfred
    2012 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATIONS (ICACC), 2012, : 223 - 228
  • [9] Deciphering complex genome rearrangements in C. elegans using short-read whole genome sequencing
    Maroilley, Tatiana
    Li, Xiao
    Oldach, Matthew
    Jean, Francesca
    Stasiuk, Susan J.
    Tarailo-Graovac, Maja
    SCIENTIFIC REPORTS, 2021, 11 (01)
  • [10] Molecular diagnostics of myotonic dystrophies from short-read whole genome sequencing data
    Lojova, Ingrid
    Kucharik, Marcel
    Pos, Zuzana
    Zatkova, Andrea
    Budis, Jaroslav
    Kadasi, Ludevit
    Szemes, Tomas
    Radvansky, Jan
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2023, 31 : 585 - 586