Finding long tandem repeats in long noisy reads

被引:3
|
作者
Morishita, Shinichi [1 ]
Ichikawa, Kazuki [1 ]
Myers, Eugene W. [2 ,3 ]
机构
[1] Univ Tokyo, Grad Sch Frontier Sci, Dept Computat Biol & Med Sci, Chiba 2778562, Japan
[2] Max Planck Inst Mol Cell Biol & Genet, D-01307 Dresden, Saxony, Germany
[3] Ctr Syst Biol Dresden, D-01307 Dresden, Saxony, Germany
关键词
FRAGILE-X; MYOTONIC-DYSTROPHY; HEXANUCLEOTIDE REPEAT; TRINUCLEOTIDE REPEAT; CTG REPEAT; EXPANSION; REGION; IDENTIFICATION; C9ORF72; FINDER;
D O I
10.1093/bioinformatics/btaa865
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10 000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10-20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats (< 1000 nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. Results: Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder, a widely used program for finding tandem repeats, in terms of sensitivity.
引用
收藏
页码:612 / 621
页数:10
相关论文
共 50 条
  • [41] Long reads: their purpose and place
    Pollard, Martin O.
    Gurdasani, Deepti
    Mentzer, Alexander J.
    Porter, Tarryn
    Sandhu, Manjinder S.
    HUMAN MOLECULAR GENETICS, 2018, 27 (R2) : R234 - R241
  • [42] Exceptionally long 5′ UTR short tandem repeats specifically linked to primates
    Namdar-Aligoodarzi, P.
    Mohammadparast, S.
    Zaker-Kandjani, B.
    Kakroodi, S. Talebi
    Vesiehsari, M. Jafari
    Ohadi, M.
    GENE, 2015, 569 (01) : 88 - 94
  • [43] CoLoRd: compressing long reads
    Kokot, Marek
    Gudys, Adam
    Li, Heng
    Deorowicz, Sebastian
    NATURE METHODS, 2022, 19 (04) : 441 - +
  • [44] CoLoRMap: Correcting Long Reads by Mapping short reads
    Haghshenas, Ehsan
    Hach, Faraz
    Sahinalp, S. Cenk
    Chauve, Cedric
    BIOINFORMATICS, 2016, 32 (17) : 545 - 551
  • [45] AS LONG AS MY CHILD READS
    ZABAWSKI, I
    READING TEACHER, 1970, 23 (07): : 631 - 632
  • [46] Informatics for PacBio Long Reads
    Suzuki, Yuta
    SINGLE MOLECULE AND SINGLE CELL SEQUENCING, 2019, 1129 : 119 - 129
  • [47] CoLoRd: compressing long reads
    Marek Kokot
    Adam Gudyś
    Heng Li
    Sebastian Deorowicz
    Nature Methods, 2022, 19 : 441 - 444
  • [48] Strainline: full-length de novo viral haplotype reconstruction from noisy long reads
    Luo, Xiao
    Kang, Xiongbin
    Schoenhuth, Alexander
    GENOME BIOLOGY, 2022, 23 (01)
  • [49] Strainline: full-length de novo viral haplotype reconstruction from noisy long reads
    Xiao Luo
    Xiongbin Kang
    Alexander Schönhuth
    Genome Biology, 23
  • [50] Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph
    Morisse, Pierre
    Lecroq, Thierry
    Lefebvre, Arnaud
    BIOINFORMATICS, 2018, 34 (24) : 4213 - 4222