Optimal Algorithms for Bounded Weighted Edit Distance

被引:2
|
作者
Cassis, Alejandro [1 ,2 ]
Kociumaka, Tomasz [2 ]
Wellnitz, Philip [2 ]
机构
[1] Saarland Univ, Saarland Informat Campus, Saarbrucken, Germany
[2] Max Planck Inst Informat, Saarland Informat Campus, Saarbrucken, Germany
基金
欧洲研究理事会;
关键词
edit distance; conditional lower bounds; string algorithms;
D O I
10.1109/FOCS57990.2023.00135
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The edit distance (also known as Levenshtein distance) of two strings is the minimum number of insertions, deletions, and substitutions of characters needed to transform one string into the other. The textbook dynamic-programming algorithm computes the edit distance of two length- n strings in O(n(2)) time, which is optimal up to subpolynomial factors assuming the Strong Exponential Time Hypothesis (SETH). An established way of circumventing this hardness is to consider the bounded setting, where the running time is parameterized by the edit distance k. A celebrated algorithm by Landau and Vishkin (JCSS'88) achieves a running time of O(n+ k(2)), which is optimal as a function of n and k (again, up to subpolynmial factors and assuming SETH). While the theory community thoroughly studied the Levenshtein distance, most practical applications rely on a more general weighted edit distance, where each edit has a weight depending on its type and the involved characters from the alphabet Sigma. This is formalized through a weight function w : Sigma boolean OR{epsilon}x Sigma U{epsilon} -> R normalized so that w(a bar right arrow a) = 0 for a is an element of Sigma boolean OR {epsilon} and w(a bar right arrow b) >= 1 for a, b is an element of Sigma boolean OR {epsilon} with a not equal= b; the goal is to find an alignment of the two strings minimizing the total weight of edits. The classic O(n(2))-time algorithm supports this setting seamlessly, but for many decades just a straightforward O(nk)-time solution was known for the bounded version of the weighted edit distance problem. Only very recently, Das, Gilbert, Hajiaghayi, Kociumaka, and Saha (STOC'23) gave the first non-trivial algorithm, achieving a time complexity of O(n + k(5)). While this running time is linear for k <= n(1/5), it is still very far from O(n + k(2))-the bound achievable in the unweighted setting. This is unsatisfactory, especially given the lack of any compelling evidence that the weighted version is inherently harder. In this paper, we essentially close this gap by showing both an improved (O) over tilde (n+ root nk(3))-time algorithm and, more surprisingly, a matching lower bound: Conditioned on the All-Pairs Shortest Paths (APSP) hypothesis, the running time of our solution is optimal for root n <= k <= n (up to subpolynomial factors). In particular, this is the first separation between the complexity of the weighted and unweighted edit distance problems. Just like the Landau-Vishkin algorithm, our algorithm can be adapted to a wide variety of settings, such as when the input is given in a compressed representation. This is because, independently of the string length n, our procedure takes (O) over tilde (k(3)) time assuming that the equality of any two substrings can be tested in (O) over tilde (1) time. Consistently with the previous work, our algorithm relies on the observation that strings with a rich structure of low-weight alignments must contain highly repetitive substrings. Nevertheless, achieving the optimal running time requires multiple new insights. We capture the right notion of repetitiveness using a tailor-made compressibility measure that we call self-edit distance. Our divide-and-conquer algorithm reduces the computation of weighted edit distance to several subproblems involving substrings of small self-edit distance and, at the same time, distributes the budget for edit weights among these subproblems. We then exploit the repetitive structure of the underlying substrings using state-of-the-art results for multiplesource shortest paths in planar graphs (Klein, SODA'05). As a stepping stone for our conditional lower bound, we study a dynamic problem of maintaining two strings subject to updates (substitutions of characters) and weighted edit distance queries. We significantly extend the construction of Abboud and Dahlgaard ( FOCS'16), originally for dynamic shortest paths in planar graphs, to show that a sequence of n updates and q <= n queries cannot be handled much faster than in O(n(2) root q) time. We then compose the snapshots of the dynamic strings to derive hardness of the static problem in the bounded setting.
引用
收藏
页码:2177 / 2187
页数:11
相关论文
共 50 条
  • [31] Constant Factor Approximation of Edit Distance of Bounded Height Unordered Trees
    Fukagawa, Daiji
    Akutsu, Tatsuya
    Takasu, Atsuhiro
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5721 : 7 - +
  • [32] Improved Algorithms for Edit Distance and LCS: Beyond Worst Case
    Boroujeni, Mahdi
    Seddighin, Masoud
    Seddighin, Saeed
    PROCEEDINGS OF THE THIRTY-FIRST ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS (SODA'20), 2020, : 1601 - 1620
  • [33] Improving Approximate Graph Edit Distance Using Genetic Algorithms
    Riesen, Kaspar
    Fischer, Andreas
    Bunke, Horst
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, 2014, 8621 : 63 - 72
  • [34] Improved Algorithms for Edit Distance and LCS: Beyond Worst Case
    Boroujeni, Mandi
    Seddighin, Masoud
    Seddighin, Saeed
    PROCEEDINGS OF THE 2020 ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, SODA, 2020, : 1601 - 1620
  • [35] Dynamic Edit Distance Table under a General Weighted Cost Function
    Hyyro, Heikki
    Narisawa, Kazuyuki
    Inenaga, Shunsuke
    SOFSEM 2010: THEORY AND PRACTICE OF COMPUTER SCIENCE, PROCEEDINGS, 2010, 5901 : 515 - +
  • [36] Correspondence edit distance to obtain a set of weighted means of graph correspondences
    Moreno-Garcia, Carlos Francisco
    Serratosa, Francesc
    Jiang, Xiaoyi
    PATTERN RECOGNITION LETTERS, 2020, 134 (134) : 29 - 36
  • [37] Transparent pronunciation scoring using articulatorily weighted phoneme edit distance
    Karhila, Reima
    Smolander, Anna-Riikka
    Ylinen, Sari
    Kurimo, Mikko
    INTERSPEECH 2019, 2019, : 1866 - 1870
  • [38] Dynamic edit distance table under a general weighted cost function
    Hyyro, Heikki
    Narisawa, Kazuyuki
    Inenaga, Shunsuke
    JOURNAL OF DISCRETE ALGORITHMS, 2015, 34 : 2 - 17
  • [39] Efficiently computing weighted tree edit distance using relaxation Labeling
    Torsello, A
    Hancock, ER
    ENERGY MINIMIZATION METHODS IN COMPUTER VISION AND PATTERN RECOGNITION, 2001, 2134 : 438 - 453
  • [40] Learning the Sub-optimal Graph Edit Distance Edit Costs Based on an Embedded Model
    Santacruz, Pep
    Serratosa, Francesc
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2018, 2018, 11004 : 282 - 292