An Efficient Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings

被引:2
|
作者
Thankachan, Sharma V. [1 ]
Chockalingam, Sriram P. [2 ]
Aluru, Srinivas [1 ]
机构
[1] Georgia Inst Technol, Sch CSE, Atlanta, GA 30332 USA
[2] Indian Inst Technol, Dept CSE, Bombay, Maharashtra, India
关键词
LINEAR-TIME CONSTRUCTION; SUFFIX-ARRAYS;
D O I
10.1007/978-3-319-38782-6_1
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest, but obtaining provably efficient solutions for such problems has been elusive. In this paper, we present a provably efficient algorithm with an expected run time guarantee of O(N log(k) N + occ), where occ is the output size, for the following problem: Given a collection D = {S-1, S-2, ..., S-n} of n sequences of total length N, a length threshold (sic) and a mismatch threshold k >= 0, report all k-mismatch maximal common substrings of length at least (sic) over all pairs of sequences in D. In addition, we present a result showing the hardness of this problem.
引用
收藏
页码:3 / 14
页数:12
相关论文
共 50 条
  • [1] A Parallel Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings
    Chockalingam, Sriram P.
    Thankachan, Sharma V.
    Aluru, Srinivas
    SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2016, : 784 - 794
  • [2] Sequential and parallel algorithms for all-pair k-mismatch maximal common substrings
    Chockalingam, Sriram P.
    Thankachan, Sharma, V
    Aluru, Srinivas
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2020, 144 : 68 - 79
  • [3] An Ultra-Fast and Parallelizable Algorithm for Finding k-Mismatch Shortest Unique Substrings
    Allen, Daniel R.
    Thankachan, Sharma, V
    Xu, Bojian
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2021, 18 (01) : 138 - 148
  • [4] Phylogeny reconstruction based on the length distribution of k-mismatch common substrings
    Morgenstern, Burkhard
    Schoebel, Svenja
    Leimeister, Chris-Andre
    ALGORITHMS FOR MOLECULAR BIOLOGY, 2017, 12
  • [5] Phylogeny reconstruction based on the length distribution of k-mismatch common substrings
    Burkhard Morgenstern
    Svenja Schöbel
    Chris-André Leimeister
    Algorithms for Molecular Biology, 12
  • [6] Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU
    Schultz, Daniel W.
    Xu, Bojian
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2021, 18 (01) : 386 - 395
  • [7] A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem
    Thankachan, Sharma V.
    Apostolico, Alberto
    Aluru, Srinivas
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2016, 23 (06) : 472 - 482
  • [8] A Practical and Efficient Algorithm for the k-mismatch Shortest Unique Substring Finding Problem
    Allen, Daniel R.
    Thankachan, Sharma, V
    Xu, Bojian
    ACM-BCB'18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2018, : 428 - 437
  • [9] Finding the position of the k-mismatch and approximate tandem repeats
    Kaplan, Haim
    Porat, Ely
    Shafrir, Nira
    ALGORITHM THEORY - SWAT 2006, PROCEEDINGS, 2006, 4059 : 90 - 101
  • [10] BatMis: a fast algorithm for k-mismatch mapping
    Tennakoon, Chandana
    Purbojati, Rikky W.
    Sung, Wing-Kin
    BIOINFORMATICS, 2012, 28 (16) : 2122 - 2128