Sequential and parallel algorithms for all-pair k-mismatch maximal common substrings
被引:0
|
作者:
Chockalingam, Sriram P.
论文数: 0引用数: 0
h-index: 0
机构:
Georgia Inst Technol, Inst Data Engn & Sci, 756 W Peachtree St NW,12th Floor, Atlanta, GA 30308 USAGeorgia Inst Technol, Inst Data Engn & Sci, 756 W Peachtree St NW,12th Floor, Atlanta, GA 30308 USA
Chockalingam, Sriram P.
[1
]
Thankachan, Sharma, V
论文数: 0引用数: 0
h-index: 0
机构:
Univ Cent Florida, Dept Comp Sci, Orlando, FL 32816 USAGeorgia Inst Technol, Inst Data Engn & Sci, 756 W Peachtree St NW,12th Floor, Atlanta, GA 30308 USA
Thankachan, Sharma, V
[3
]
Aluru, Srinivas
论文数: 0引用数: 0
h-index: 0
机构:
Georgia Inst Technol, Inst Data Engn & Sci, 756 W Peachtree St NW,12th Floor, Atlanta, GA 30308 USA
Georgia Inst Technol, Dept Computat Sci & Engn, 756 W Peachtree St NW,13th Floor, Atlanta, GA 30308 USAGeorgia Inst Technol, Inst Data Engn & Sci, 756 W Peachtree St NW,12th Floor, Atlanta, GA 30308 USA
Aluru, Srinivas
[1
,2
]
机构:
[1] Georgia Inst Technol, Inst Data Engn & Sci, 756 W Peachtree St NW,12th Floor, Atlanta, GA 30308 USA
[2] Georgia Inst Technol, Dept Computat Sci & Engn, 756 W Peachtree St NW,13th Floor, Atlanta, GA 30308 USA
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest. Formally, let D be a collection of n sequences of total length N, phi be a length threshold, and k be a mismatch threshold. The goal is to identify and report all k-mismatch maximal common substrings of length at least phi over all pairs of strings in D. Heuristics based on seed-and-extend style filtering techniques are often employed in such applications. However, such methods cannot provide any provably efficient run time guarantees. To this end, we present a sequential algorithm with an expected run time of O(N log(k) N+occ), where occ is the output size. We then present a distributed memory parallel algorithm with an expected run time of O ((N/P log N + occ) log(k) N) using O (log(k+1) N) expected rounds of global communications, under some realistic assumptions, where p is the number of processors. Finally, we demonstrate the performance and scalability of our algorithms using experiments on large high throughput sequencing data. (C) 2020 Elsevier Inc. All rights reserved.