Resolving complex tandem repeats with long reads

被引:46
|
作者
Ummat, Ajay
Bashir, Ali [1 ]
机构
[1] Icahn Sch Med Mt Sinai, Dept Genet & Genom Sci, New York, NY 10029 USA
关键词
STRUCTURAL VARIATION; PAIRED-END; SEQUENCE; ALIGNMENT; IDENTIFICATION; POPULATION; VARIANT; SEARCH;
D O I
10.1093/bioinformatics/btu437
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Resolving tandemly repeated genomic sequences is a necessary step in improving our understanding of the human genome. Short tandem repeats (TRs), or microsatellites, are often used as molecular markers in genetics, and clinically, variation in microsatellites can lead to genetic disorders like Huntington's diseases. Accurately resolving repeats, and in particular TRs, remains a challenging task in genome alignment, assembly and variation calling. Though tools have been developed for detecting microsatellites in short-read sequencing data, these are limited in the size and types of events they can resolve. Single-molecule sequencing technologies may potentially resolve a broader spectrum of TRs given their increased length, but require new approaches given their significantly higher raw error profiles. However, due to inherent error profiles of the single-molecule technologies, these reads presents a unique challenge in terms of accurately identifying and estimating the TRs. Results: Here we present PACMONSTR, a reference-based probabilistic approach, to identify the TR region and estimate the number of these TR elements in long DNA reads. We present a multistep approach that requires as input, a reference region and the reference TR element. Initially, the TR region is identified from the long DNA reads via a 3-stage modified Smith-Waterman approach and then, expected number of TR elements is calculated using a pair-Hidden Markov Models-based method. Finally, TR-based genotype selection (or clustering: homozygous/heterozygous) is performed with Gaussian mixture models, using the Akaike information criteria, and coverage expectations.
引用
收藏
页码:3491 / 3498
页数:8
相关论文
共 50 条
  • [1] Finding long tandem repeats in long noisy reads
    Morishita, Shinichi
    Ichikawa, Kazuki
    Myers, Eugene W.
    BIOINFORMATICS, 2021, 37 (05) : 612 - 621
  • [2] Decomposing mosaic tandem repeats accurately from long reads
    Masutani, Bansho
    Kawahara, Riki
    Morishita, Shinichi
    BIOINFORMATICS, 2023, 39 (04)
  • [3] Telomeres terminating with long complex tandem repeats
    Kamnert, I
    Lopez, CC
    Rosen, M
    Edstrom, JE
    HEREDITAS, 1997, 127 (03) : 175 - 180
  • [4] TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats
    Mikheenko, Alla
    Bzikadze, Andrey, V
    Gurevich, Alexey
    Miga, Karen H.
    Pevzner, Pavel A.
    BIOINFORMATICS, 2020, 36 : 75 - 83
  • [5] LongTR: genome-wide profiling of genetic variation at tandem repeats from long reads
    Jam, Helyaneh Ziaei
    Zook, Justin M.
    Javadzadeh, Sara
    Park, Jonghun
    Sehgal, Aarushi
    Gymrek, Melissa
    GENOME BIOLOGY, 2024, 25 (01):
  • [6] Resolving repeat families with long reads
    Philipp Bongartz
    BMC Bioinformatics, 20
  • [7] Resolving repeat families with long reads
    Bongartz, Philipp
    BMC BIOINFORMATICS, 2019, 20 (1)
  • [8] RF: A method for filtering short reads with tandem repeats for genome mapping
    Misawa, Kazuharu
    GENOMICS, 2013, 102 (01) : 35 - 37
  • [9] ReviSTER: an automated pipeline to revise misaligned reads to simple tandem repeats
    Tae, Hongseok
    McMahon, Kevin W.
    Settlage, Robert E.
    Bavarva, Jasmin H.
    Garner, Harold R.
    BIOINFORMATICS, 2013, 29 (14) : 1734 - 1741
  • [10] Probably Correct: Rescuing Repeats with Short and Long Reads
    Cechova, Monika
    GENES, 2021, 12 (01) : 1 - 13