Sequential and parallel algorithms for all-pair k-mismatch maximal common substrings

被引:0
|
作者
Chockalingam, Sriram P. [1 ]
Thankachan, Sharma, V [3 ]
Aluru, Srinivas [1 ,2 ]
机构
[1] Georgia Inst Technol, Inst Data Engn & Sci, 756 W Peachtree St NW,12th Floor, Atlanta, GA 30308 USA
[2] Georgia Inst Technol, Dept Computat Sci & Engn, 756 W Peachtree St NW,13th Floor, Atlanta, GA 30308 USA
[3] Univ Cent Florida, Dept Comp Sci, Orlando, FL 32816 USA
基金
美国国家科学基金会;
关键词
Approximate sequence matching; String algorithms; Suffix trees; Hamming distance; Parallel algorithms;
D O I
10.1016/j.jpdc.2020.05.018
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest. Formally, let D be a collection of n sequences of total length N, phi be a length threshold, and k be a mismatch threshold. The goal is to identify and report all k-mismatch maximal common substrings of length at least phi over all pairs of strings in D. Heuristics based on seed-and-extend style filtering techniques are often employed in such applications. However, such methods cannot provide any provably efficient run time guarantees. To this end, we present a sequential algorithm with an expected run time of O(N log(k) N+occ), where occ is the output size. We then present a distributed memory parallel algorithm with an expected run time of O ((N/P log N + occ) log(k) N) using O (log(k+1) N) expected rounds of global communications, under some realistic assumptions, where p is the number of processors. Finally, we demonstrate the performance and scalability of our algorithms using experiments on large high throughput sequencing data. (C) 2020 Elsevier Inc. All rights reserved.
引用
收藏
页码:68 / 79
页数:12
相关论文
共 32 条
  • [1] A Parallel Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings
    Chockalingam, Sriram P.
    Thankachan, Sharma V.
    Aluru, Srinivas
    SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2016, : 784 - 794
  • [2] An Efficient Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings
    Thankachan, Sharma V.
    Chockalingam, Sriram P.
    Aluru, Srinivas
    BIOINFORMATICS RESEARCH AND APPLICATIONS, ISBRA 2016, 2016, 9683 : 3 - 14
  • [3] Phylogeny reconstruction based on the length distribution of k-mismatch common substrings
    Morgenstern, Burkhard
    Schoebel, Svenja
    Leimeister, Chris-Andre
    ALGORITHMS FOR MOLECULAR BIOLOGY, 2017, 12
  • [4] Phylogeny reconstruction based on the length distribution of k-mismatch common substrings
    Burkhard Morgenstern
    Svenja Schöbel
    Chris-André Leimeister
    Algorithms for Molecular Biology, 12
  • [5] Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU
    Schultz, Daniel W.
    Xu, Bojian
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2021, 18 (01) : 386 - 395
  • [6] An Ultra-Fast and Parallelizable Algorithm for Finding k-Mismatch Shortest Unique Substrings
    Allen, Daniel R.
    Thankachan, Sharma, V
    Xu, Bojian
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2021, 18 (01) : 138 - 148
  • [7] A Parallel Method for All-Pair SimRank Similarity Computation
    Huang, Xuan
    Gao, Xingkun
    Tang, Jie
    Wu, Gangshan
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2018, PT I, 2018, 11334 : 593 - 607
  • [8] A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem
    Thankachan, Sharma V.
    Apostolico, Alberto
    Aluru, Srinivas
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2016, 23 (06) : 472 - 482
  • [10] Efficient algorithms for the longest common subsequence in k-length substrings
    Deorowicz, Sebastian
    Grabowski, Szymon
    INFORMATION PROCESSING LETTERS, 2014, 114 (11) : 634 - 638