Finding long tandem repeats in long noisy reads

被引:3
|
作者
Morishita, Shinichi [1 ]
Ichikawa, Kazuki [1 ]
Myers, Eugene W. [2 ,3 ]
机构
[1] Univ Tokyo, Grad Sch Frontier Sci, Dept Computat Biol & Med Sci, Chiba 2778562, Japan
[2] Max Planck Inst Mol Cell Biol & Genet, D-01307 Dresden, Saxony, Germany
[3] Ctr Syst Biol Dresden, D-01307 Dresden, Saxony, Germany
关键词
FRAGILE-X; MYOTONIC-DYSTROPHY; HEXANUCLEOTIDE REPEAT; TRINUCLEOTIDE REPEAT; CTG REPEAT; EXPANSION; REGION; IDENTIFICATION; C9ORF72; FINDER;
D O I
10.1093/bioinformatics/btaa865
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10 000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10-20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats (< 1000 nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. Results: Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder, a widely used program for finding tandem repeats, in terms of sensitivity.
引用
收藏
页码:612 / 621
页数:10
相关论文
共 50 条
  • [1] Resolving complex tandem repeats with long reads
    Ummat, Ajay
    Bashir, Ali
    BIOINFORMATICS, 2014, 30 (24) : 3491 - 3498
  • [2] Decomposing mosaic tandem repeats accurately from long reads
    Masutani, Bansho
    Kawahara, Riki
    Morishita, Shinichi
    BIOINFORMATICS, 2023, 39 (04)
  • [3] TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats
    Mikheenko, Alla
    Bzikadze, Andrey, V
    Gurevich, Alexey
    Miga, Karen H.
    Pevzner, Pavel A.
    BIOINFORMATICS, 2020, 36 : 75 - 83
  • [4] LongTR: genome-wide profiling of genetic variation at tandem repeats from long reads
    Jam, Helyaneh Ziaei
    Zook, Justin M.
    Javadzadeh, Sara
    Park, Jonghun
    Sehgal, Aarushi
    Gymrek, Melissa
    GENOME BIOLOGY, 2024, 25 (01):
  • [5] Probably Correct: Rescuing Repeats with Short and Long Reads
    Cechova, Monika
    GENES, 2021, 12 (01) : 1 - 13
  • [6] HairSplitter: haplotype assembly from long, noisy reads
    Faure, Roland
    Lavenier, Dominique
    Flot, Jean-Francois
    PEER COMMUNITY JOURNAL, 2024, 4
  • [7] Efficient Local Alignment Discovery amongst Noisy Long Reads
    Myers, Gene
    ALGORITHMS IN BIOINFORMATICS, 2014, 8701 : 52 - 67
  • [8] Finding Long and Multiple Repeats with Edit Distance
    Federico, Maria
    Peterlongo, Pierre
    Pisanti, Nadia
    Sagot, Marie-France
    PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2011, 2011, : 83 - 97
  • [9] Telomeres terminating with long complex tandem repeats
    Kamnert, I
    Lopez, CC
    Rosen, M
    Edstrom, JE
    HEREDITAS, 1997, 127 (03) : 175 - 180
  • [10] Haplotype-aware diplotyping from noisy long reads
    Jana Ebler
    Marina Haukness
    Trevor Pesout
    Tobias Marschall
    Benedict Paten
    Genome Biology, 20