Finding long tandem repeats in long noisy reads

被引:3
|
作者
Morishita, Shinichi [1 ]
Ichikawa, Kazuki [1 ]
Myers, Eugene W. [2 ,3 ]
机构
[1] Univ Tokyo, Grad Sch Frontier Sci, Dept Computat Biol & Med Sci, Chiba 2778562, Japan
[2] Max Planck Inst Mol Cell Biol & Genet, D-01307 Dresden, Saxony, Germany
[3] Ctr Syst Biol Dresden, D-01307 Dresden, Saxony, Germany
关键词
FRAGILE-X; MYOTONIC-DYSTROPHY; HEXANUCLEOTIDE REPEAT; TRINUCLEOTIDE REPEAT; CTG REPEAT; EXPANSION; REGION; IDENTIFICATION; C9ORF72; FINDER;
D O I
10.1093/bioinformatics/btaa865
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10 000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10-20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats (< 1000 nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. Results: Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder, a widely used program for finding tandem repeats, in terms of sensitivity.
引用
收藏
页码:612 / 621
页数:10
相关论文
共 50 条
  • [21] NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads
    Hu, Jiang
    Wang, Zhuo
    Sun, Zongyi
    Hu, Benxia
    Ayoola, Adeola Oluwakemi
    Liang, Fan
    Li, Jingjing
    Sandoval, Jose R.
    Cooper, David N.
    Ye, Kai
    Ruan, Jue
    Xiao, Chuan-Le
    Wang, Depeng
    Wu, Dong-Dong
    Wang, Sheng
    GENOME BIOLOGY, 2024, 25 (01)
  • [22] invMap: a sensitive mapping tool for long noisy reads with inversion structural variants
    Wei, Ze-Gang
    Bu, Peng-Yu
    Zhang, Xiao-Dan
    Liu, Fei
    Qian, Yu
    Wu, Fang-Xiang
    BIOINFORMATICS, 2023, 39 (12)
  • [23] The Dark Matter of Large Cereal Genomes: Long Tandem Repeats
    Kapustova, Veronika
    Tulpova, Zuzana
    Toegelova, Helena
    Novak, Petr
    Macas, Jiri
    Karafiatova, Miroslava
    Hribova, Eva
    Dolezel, Jaroslav
    Simkova, Hana
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2019, 20 (10)
  • [24] Terminal long tandem repeats in chromosomes from Chironomus pallidivittatus
    Lopez, CC
    Nielsen, L
    Edstrom, JE
    MOLECULAR AND CELLULAR BIOLOGY, 1996, 16 (07) : 3285 - 3290
  • [25] Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads
    Satomi Mitsuhashi
    Martin C. Frith
    Takeshi Mizuguchi
    Satoko Miyatake
    Tomoko Toyota
    Hiroaki Adachi
    Yoko Oma
    Yoshihiro Kino
    Hiroaki Mitsuhashi
    Naomichi Matsumoto
    Genome Biology, 20
  • [26] Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads
    Mitsuhashi, Satomi
    Frith, Martin C.
    Mizuguchi, Takeshi
    Miyatake, Satoko
    Toyota, Tomoko
    Adachi, Hiroaki
    Oma, Yoko
    Kino, Yoshihiro
    Mitsuhashi, Hiroaki
    Matsumoto, Naomichi
    GENOME BIOLOGY, 2019, 20
  • [27] STRING: finding tandem repeats in DNA sequences
    Parisi, V
    De Fonzo, V
    Aluffi-Pentini, F
    BIOINFORMATICS, 2003, 19 (14) : 1733 - 1738
  • [28] Finding approximate tandem repeats in genomic sequences
    Wexler, Y
    Yakhini, Z
    Kashi, Y
    Geiger, D
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2005, 12 (07) : 928 - 942
  • [29] RF: A method for filtering short reads with tandem repeats for genome mapping
    Misawa, Kazuharu
    GENOMICS, 2013, 102 (01) : 35 - 37
  • [30] pathMap: a path-based mapping tool for long noisy reads with high sensitivity
    Wei, Ze-Gang
    Zhang, Xiao-Dan
    Fan, Xing-Guo
    Qian, Yu
    Liu, Fei
    Wu, Fang-Xiang
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (02)