Finding long tandem repeats in long noisy reads

被引：3

作者：

Morishita, Shinichi ^{[1
]}

Ichikawa, Kazuki ^{[1
]}

Myers, Eugene W. ^{[2
,3
]}

机构：

[1] Univ Tokyo, Grad Sch Frontier Sci, Dept Computat Biol & Med Sci, Chiba 2778562, Japan

[2] Max Planck Inst Mol Cell Biol & Genet, D-01307 Dresden, Saxony, Germany

[3] Ctr Syst Biol Dresden, D-01307 Dresden, Saxony, Germany

来源：

BIOINFORMATICS | 2021年 / 37卷 / 05期

关键词：

FRAGILE-X; MYOTONIC-DYSTROPHY; HEXANUCLEOTIDE REPEAT; TRINUCLEOTIDE REPEAT; CTG REPEAT; EXPANSION; REGION; IDENTIFICATION; C9ORF72; FINDER;

D O I：

10.1093/bioinformatics/btaa865

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10 000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10-20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats (< 1000 nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. Results: Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder, a widely used program for finding tandem repeats, in terms of sensitivity.

引用

页码：612 / 621

页数：10

共 50 条

[1] Resolving complex tandem repeats with long reads
Ummat, Ajay
Bashir, Ali
BIOINFORMATICS, 2014, 30 (24) : 3491 - 3498
[2] Decomposing mosaic tandem repeats accurately from long reads
Masutani, Bansho
Kawahara, Riki
Morishita, Shinichi
BIOINFORMATICS, 2023, 39 (04)
[3] TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats
Mikheenko, Alla
Bzikadze, Andrey, V
Gurevich, Alexey
Miga, Karen H.
Pevzner, Pavel A.
BIOINFORMATICS, 2020, 36 : 75 - 83
[4] LongTR: genome-wide profiling of genetic variation at tandem repeats from long reads
Jam, Helyaneh Ziaei
Zook, Justin M.
Javadzadeh, Sara
Park, Jonghun
Sehgal, Aarushi
Gymrek, Melissa
GENOME BIOLOGY, 2024, 25 (01):
[5] Probably Correct: Rescuing Repeats with Short and Long Reads
Cechova, Monika
GENES, 2021, 12 (01) : 1 - 13
[6] HairSplitter: haplotype assembly from long, noisy reads
Faure, Roland
Lavenier, Dominique
Flot, Jean-Francois
PEER COMMUNITY JOURNAL, 2024, 4
[7] Efficient Local Alignment Discovery amongst Noisy Long Reads
Myers, Gene
ALGORITHMS IN BIOINFORMATICS, 2014, 8701 : 52 - 67
[8] Finding Long and Multiple Repeats with Edit Distance
Federico, Maria
Peterlongo, Pierre
Pisanti, Nadia
Sagot, Marie-France
PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2011, 2011, : 83 - 97
[9] Telomeres terminating with long complex tandem repeats
Kamnert, I
Lopez, CC
Rosen, M
Edstrom, JE
HEREDITAS, 1997, 127 (03) : 175 - 180
[10] Haplotype-aware diplotyping from noisy long reads
Jana Ebler
Marina Haukness
Trevor Pesout
Tobias Marschall
Benedict Paten
Genome Biology, 20

← 1 2 3 4 5 →