Optimal Algorithms for Bounded Weighted Edit Distance

被引:2
|
作者
Cassis, Alejandro [1 ,2 ]
Kociumaka, Tomasz [2 ]
Wellnitz, Philip [2 ]
机构
[1] Saarland Univ, Saarland Informat Campus, Saarbrucken, Germany
[2] Max Planck Inst Informat, Saarland Informat Campus, Saarbrucken, Germany
基金
欧洲研究理事会;
关键词
edit distance; conditional lower bounds; string algorithms;
D O I
10.1109/FOCS57990.2023.00135
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The edit distance (also known as Levenshtein distance) of two strings is the minimum number of insertions, deletions, and substitutions of characters needed to transform one string into the other. The textbook dynamic-programming algorithm computes the edit distance of two length- n strings in O(n(2)) time, which is optimal up to subpolynomial factors assuming the Strong Exponential Time Hypothesis (SETH). An established way of circumventing this hardness is to consider the bounded setting, where the running time is parameterized by the edit distance k. A celebrated algorithm by Landau and Vishkin (JCSS'88) achieves a running time of O(n+ k(2)), which is optimal as a function of n and k (again, up to subpolynmial factors and assuming SETH). While the theory community thoroughly studied the Levenshtein distance, most practical applications rely on a more general weighted edit distance, where each edit has a weight depending on its type and the involved characters from the alphabet Sigma. This is formalized through a weight function w : Sigma boolean OR{epsilon}x Sigma U{epsilon} -> R normalized so that w(a bar right arrow a) = 0 for a is an element of Sigma boolean OR {epsilon} and w(a bar right arrow b) >= 1 for a, b is an element of Sigma boolean OR {epsilon} with a not equal= b; the goal is to find an alignment of the two strings minimizing the total weight of edits. The classic O(n(2))-time algorithm supports this setting seamlessly, but for many decades just a straightforward O(nk)-time solution was known for the bounded version of the weighted edit distance problem. Only very recently, Das, Gilbert, Hajiaghayi, Kociumaka, and Saha (STOC'23) gave the first non-trivial algorithm, achieving a time complexity of O(n + k(5)). While this running time is linear for k <= n(1/5), it is still very far from O(n + k(2))-the bound achievable in the unweighted setting. This is unsatisfactory, especially given the lack of any compelling evidence that the weighted version is inherently harder. In this paper, we essentially close this gap by showing both an improved (O) over tilde (n+ root nk(3))-time algorithm and, more surprisingly, a matching lower bound: Conditioned on the All-Pairs Shortest Paths (APSP) hypothesis, the running time of our solution is optimal for root n <= k <= n (up to subpolynomial factors). In particular, this is the first separation between the complexity of the weighted and unweighted edit distance problems. Just like the Landau-Vishkin algorithm, our algorithm can be adapted to a wide variety of settings, such as when the input is given in a compressed representation. This is because, independently of the string length n, our procedure takes (O) over tilde (k(3)) time assuming that the equality of any two substrings can be tested in (O) over tilde (1) time. Consistently with the previous work, our algorithm relies on the observation that strings with a rich structure of low-weight alignments must contain highly repetitive substrings. Nevertheless, achieving the optimal running time requires multiple new insights. We capture the right notion of repetitiveness using a tailor-made compressibility measure that we call self-edit distance. Our divide-and-conquer algorithm reduces the computation of weighted edit distance to several subproblems involving substrings of small self-edit distance and, at the same time, distributes the budget for edit weights among these subproblems. We then exploit the repetitive structure of the underlying substrings using state-of-the-art results for multiplesource shortest paths in planar graphs (Klein, SODA'05). As a stepping stone for our conditional lower bound, we study a dynamic problem of maintaining two strings subject to updates (substitutions of characters) and weighted edit distance queries. We significantly extend the construction of Abboud and Dahlgaard ( FOCS'16), originally for dynamic shortest paths in planar graphs, to show that a sequence of n updates and q <= n queries cannot be handled much faster than in O(n(2) root q) time. We then compose the snapshots of the dynamic strings to derive hardness of the static problem in the bounded setting.
引用
收藏
页码:2177 / 2187
页数:11
相关论文
共 50 条
  • [1] Near-Optimal Quantum Algorithms for Bounded Edit Distance and Lempel-Ziv Factorization
    Gibney, Daniel
    Jin, Ce
    Kociumaka, Tomasz
    Thankachan, Sharma V.
    PROCEEDINGS OF THE 2024 ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, SODA, 2024, : 3302 - 3332
  • [2] Trace Reconstruction with Bounded Edit Distance
    Sima, Jin
    Bruck, Jehoshua
    2021 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2021, : 2519 - 2524
  • [3] Paradigm Clustering with Weighted Edit Distance
    Gerlach, Andrew
    Wiemerslage, Adam
    Kann, Katharina
    SIGMORPHON 2021: 18TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS, PHONOLOGY, AND MORPHOLOGY, 2021, : 107 - 114
  • [4] Edit-distance of weighted automata
    Mohri, M
    IMPLEMENTATION AND APPLICATION OF AUTOMATA, 2003, 2608 : 1 - 23
  • [5] Bounded Occurrence Edit Distance: A New Metric for String Similarity Joins with Edit Distance Constraints
    Komatsu, Tomoki
    Okuta, Ryosuke
    Narisawa, Kazuyuki
    Shinohara, Ayumi
    SOFSEM 2014: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2014, 8327 : 363 - 374
  • [6] Analysis of tree edit distance algorithms
    Dulucq, S
    Touzet, H
    COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2003, 2676 : 83 - 95
  • [7] Theoretical Analysis of Edit Distance Algorithms
    Medvedev, Paul
    COMMUNICATIONS OF THE ACM, 2023, 66 (12) : 64 - 71
  • [8] Sublinear Algorithms for Gap Edit Distance
    Goldenberg, Elazar
    Krauthgamer, Robert
    Saha, Barna
    2019 IEEE 60TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS 2019), 2019, : 1101 - 1120
  • [9] Improved MPC Algorithms for Edit Distance and Ulam Distance
    Boroujeni, Mahdi
    Seddighin, Saeed
    SPAA'19: PROCEEDINGS OF THE 31ST ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURESS, 2019, 2019, : 31 - 40
  • [10] Improved MPC Algorithms for Edit Distance and Ulam Distance
    Boroujeni, Mahdi
    Ghodsi, Mohammad
    Seddighin, Saeed
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (11) : 2764 - 2776