An Efficient Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings

被引:2
|
作者
Thankachan, Sharma V. [1 ]
Chockalingam, Sriram P. [2 ]
Aluru, Srinivas [1 ]
机构
[1] Georgia Inst Technol, Sch CSE, Atlanta, GA 30332 USA
[2] Indian Inst Technol, Dept CSE, Bombay, Maharashtra, India
关键词
LINEAR-TIME CONSTRUCTION; SUFFIX-ARRAYS;
D O I
10.1007/978-3-319-38782-6_1
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest, but obtaining provably efficient solutions for such problems has been elusive. In this paper, we present a provably efficient algorithm with an expected run time guarantee of O(N log(k) N + occ), where occ is the output size, for the following problem: Given a collection D = {S-1, S-2, ..., S-n} of n sequences of total length N, a length threshold (sic) and a mismatch threshold k >= 0, report all k-mismatch maximal common substrings of length at least (sic) over all pairs of sequences in D. In addition, we present a result showing the hardness of this problem.
引用
收藏
页码:3 / 14
页数:12
相关论文
共 50 条