Optimal Algorithms for Bounded Weighted Edit Distance

被引：2

作者：

Cassis, Alejandro ^{[1
,2
]}

Kociumaka, Tomasz ^{[2
]}

Wellnitz, Philip ^{[2
]}

机构：

[1] Saarland Univ, Saarland Informat Campus, Saarbrucken, Germany

[2] Max Planck Inst Informat, Saarland Informat Campus, Saarbrucken, Germany

来源：

2023 IEEE 64TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, FOCS | 2023年

基金：

欧洲研究理事会;

关键词：

edit distance; conditional lower bounds; string algorithms;

D O I：

10.1109/FOCS57990.2023.00135

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The edit distance (also known as Levenshtein distance) of two strings is the minimum number of insertions, deletions, and substitutions of characters needed to transform one string into the other. The textbook dynamic-programming algorithm computes the edit distance of two length- n strings in O(n(2)) time, which is optimal up to subpolynomial factors assuming the Strong Exponential Time Hypothesis (SETH). An established way of circumventing this hardness is to consider the bounded setting, where the running time is parameterized by the edit distance k. A celebrated algorithm by Landau and Vishkin (JCSS'88) achieves a running time of O(n+ k(2)), which is optimal as a function of n and k (again, up to subpolynmial factors and assuming SETH). While the theory community thoroughly studied the Levenshtein distance, most practical applications rely on a more general weighted edit distance, where each edit has a weight depending on its type and the involved characters from the alphabet Sigma. This is formalized through a weight function w : Sigma boolean OR{epsilon}x Sigma U{epsilon} -> R normalized so that w(a bar right arrow a) = 0 for a is an element of Sigma boolean OR {epsilon} and w(a bar right arrow b) >= 1 for a, b is an element of Sigma boolean OR {epsilon} with a not equal= b; the goal is to find an alignment of the two strings minimizing the total weight of edits. The classic O(n(2))-time algorithm supports this setting seamlessly, but for many decades just a straightforward O(nk)-time solution was known for the bounded version of the weighted edit distance problem. Only very recently, Das, Gilbert, Hajiaghayi, Kociumaka, and Saha (STOC'23) gave the first non-trivial algorithm, achieving a time complexity of O(n + k(5)). While this running time is linear for k <= n(1/5), it is still very far from O(n + k(2))-the bound achievable in the unweighted setting. This is unsatisfactory, especially given the lack of any compelling evidence that the weighted version is inherently harder. In this paper, we essentially close this gap by showing both an improved (O) over tilde (n+ root nk(3))-time algorithm and, more surprisingly, a matching lower bound: Conditioned on the All-Pairs Shortest Paths (APSP) hypothesis, the running time of our solution is optimal for root n <= k <= n (up to subpolynomial factors). In particular, this is the first separation between the complexity of the weighted and unweighted edit distance problems. Just like the Landau-Vishkin algorithm, our algorithm can be adapted to a wide variety of settings, such as when the input is given in a compressed representation. This is because, independently of the string length n, our procedure takes (O) over tilde (k(3)) time assuming that the equality of any two substrings can be tested in (O) over tilde (1) time. Consistently with the previous work, our algorithm relies on the observation that strings with a rich structure of low-weight alignments must contain highly repetitive substrings. Nevertheless, achieving the optimal running time requires multiple new insights. We capture the right notion of repetitiveness using a tailor-made compressibility measure that we call self-edit distance. Our divide-and-conquer algorithm reduces the computation of weighted edit distance to several subproblems involving substrings of small self-edit distance and, at the same time, distributes the budget for edit weights among these subproblems. We then exploit the repetitive structure of the underlying substrings using state-of-the-art results for multiplesource shortest paths in planar graphs (Klein, SODA'05). As a stepping stone for our conditional lower bound, we study a dynamic problem of maintaining two strings subject to updates (substitutions of characters) and weighted edit distance queries. We significantly extend the construction of Abboud and Dahlgaard ( FOCS'16), originally for dynamic shortest paths in planar graphs, to show that a sequence of n updates and q <= n queries cannot be handled much faster than in O(n(2) root q) time. We then compose the snapshots of the dynamic strings to derive hardness of the static problem in the bounded setting.

引用

页码：2177 / 2187

页数：11

共 50 条

[41] Faster and Space-Optimal Edit Distance "1" Dictionary
Belazzougui, Djamal
COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2009, 5577 : 154 - 167
[42] TRULY SUBCUBIC ALGORITHMS FOR LANGUAGE EDIT DISTANCE AND RNA FOLDING VIA FAST BOUNDED-DIFFERENCE MIN-PLUS PRODUCT
Bringmann, Karl
Grandoni, Fabrizio
Saha, Barna
Williams, Virginia Vassilevska
SIAM JOURNAL ON COMPUTING, 2019, 48 (02) : 481 - 512
[43] Approximating tree edit distance through string edit distance
Akutsu, Tatsuya
Fukagawa, Daiji
Takasu, Atsuhiro
ALGORITHMS AND COMPUTATION, PROCEEDINGS, 2006, 4288 : 90 - +
[44] Approximating Tree Edit Distance through String Edit Distance
Akutsu, Tatsuya
Fukagawa, Daiji
Takasu, Atsuhiro
ALGORITHMICA, 2010, 57 (02) : 325 - 348
[45] Graph Edit Distance or Graph Edit Pseudo-Distance?
Serratosa, Francesc
Cortes, Xavier
Moreno, Carlos-Francisco
STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2016, 2016, 10029 : 530 - 540
[46] Approximating Tree Edit Distance through String Edit Distance
Tatsuya Akutsu
Daiji Fukagawa
Atsuhiro Takasu
Algorithmica, 2010, 57 : 325 - 348
[47] From edit distance to augmented space-time-weighted edit distance: Detecting and clustering patterns of human activities in Puget Sound region
Zhai, Wei
Bai, Xueyin
Peng, Zhong-ren
Gu, Chaolin
JOURNAL OF TRANSPORT GEOGRAPHY, 2019, 78 : 41 - 55
[48] Efficient sequential and parallel algorithms for finding edit distance based motifs
Soumitra Pal
Peng Xiao
Sanguthevar Rajasekaran
BMC Genomics, 17
[49] Data quality system using reference dictionaries and edit distance algorithms
Karbarz, Radoslaw
Mulawka, Jan
PHOTONICS APPLICATIONS IN ASTRONOMY, COMMUNICATIONS, INDUSTRY, AND HIGH-ENERGY PHYSICS EXPERIMENTS 2015, 2015, 9662
[50] Efficient sequential and parallel algorithms for finding edit distance based motifs
Pal, Soumitra
Xiao, Peng
Rajasekaran, Sanguthevar
BMC GENOMICS, 2016, 17

← 1 2 3 4 5 →