Optimal Algorithms for Bounded Weighted Edit Distance

被引:2
|
作者
Cassis, Alejandro [1 ,2 ]
Kociumaka, Tomasz [2 ]
Wellnitz, Philip [2 ]
机构
[1] Saarland Univ, Saarland Informat Campus, Saarbrucken, Germany
[2] Max Planck Inst Informat, Saarland Informat Campus, Saarbrucken, Germany
基金
欧洲研究理事会;
关键词
edit distance; conditional lower bounds; string algorithms;
D O I
10.1109/FOCS57990.2023.00135
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The edit distance (also known as Levenshtein distance) of two strings is the minimum number of insertions, deletions, and substitutions of characters needed to transform one string into the other. The textbook dynamic-programming algorithm computes the edit distance of two length- n strings in O(n(2)) time, which is optimal up to subpolynomial factors assuming the Strong Exponential Time Hypothesis (SETH). An established way of circumventing this hardness is to consider the bounded setting, where the running time is parameterized by the edit distance k. A celebrated algorithm by Landau and Vishkin (JCSS'88) achieves a running time of O(n+ k(2)), which is optimal as a function of n and k (again, up to subpolynmial factors and assuming SETH). While the theory community thoroughly studied the Levenshtein distance, most practical applications rely on a more general weighted edit distance, where each edit has a weight depending on its type and the involved characters from the alphabet Sigma. This is formalized through a weight function w : Sigma boolean OR{epsilon}x Sigma U{epsilon} -> R normalized so that w(a bar right arrow a) = 0 for a is an element of Sigma boolean OR {epsilon} and w(a bar right arrow b) >= 1 for a, b is an element of Sigma boolean OR {epsilon} with a not equal= b; the goal is to find an alignment of the two strings minimizing the total weight of edits. The classic O(n(2))-time algorithm supports this setting seamlessly, but for many decades just a straightforward O(nk)-time solution was known for the bounded version of the weighted edit distance problem. Only very recently, Das, Gilbert, Hajiaghayi, Kociumaka, and Saha (STOC'23) gave the first non-trivial algorithm, achieving a time complexity of O(n + k(5)). While this running time is linear for k <= n(1/5), it is still very far from O(n + k(2))-the bound achievable in the unweighted setting. This is unsatisfactory, especially given the lack of any compelling evidence that the weighted version is inherently harder. In this paper, we essentially close this gap by showing both an improved (O) over tilde (n+ root nk(3))-time algorithm and, more surprisingly, a matching lower bound: Conditioned on the All-Pairs Shortest Paths (APSP) hypothesis, the running time of our solution is optimal for root n <= k <= n (up to subpolynomial factors). In particular, this is the first separation between the complexity of the weighted and unweighted edit distance problems. Just like the Landau-Vishkin algorithm, our algorithm can be adapted to a wide variety of settings, such as when the input is given in a compressed representation. This is because, independently of the string length n, our procedure takes (O) over tilde (k(3)) time assuming that the equality of any two substrings can be tested in (O) over tilde (1) time. Consistently with the previous work, our algorithm relies on the observation that strings with a rich structure of low-weight alignments must contain highly repetitive substrings. Nevertheless, achieving the optimal running time requires multiple new insights. We capture the right notion of repetitiveness using a tailor-made compressibility measure that we call self-edit distance. Our divide-and-conquer algorithm reduces the computation of weighted edit distance to several subproblems involving substrings of small self-edit distance and, at the same time, distributes the budget for edit weights among these subproblems. We then exploit the repetitive structure of the underlying substrings using state-of-the-art results for multiplesource shortest paths in planar graphs (Klein, SODA'05). As a stepping stone for our conditional lower bound, we study a dynamic problem of maintaining two strings subject to updates (substitutions of characters) and weighted edit distance queries. We significantly extend the construction of Abboud and Dahlgaard ( FOCS'16), originally for dynamic shortest paths in planar graphs, to show that a sequence of n updates and q <= n queries cannot be handled much faster than in O(n(2) root q) time. We then compose the snapshots of the dynamic strings to derive hardness of the static problem in the bounded setting.
引用
收藏
页码:2177 / 2187
页数:11
相关论文
共 50 条
  • [41] Faster and Space-Optimal Edit Distance "1" Dictionary
    Belazzougui, Djamal
    COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2009, 5577 : 154 - 167
  • [42] TRULY SUBCUBIC ALGORITHMS FOR LANGUAGE EDIT DISTANCE AND RNA FOLDING VIA FAST BOUNDED-DIFFERENCE MIN-PLUS PRODUCT
    Bringmann, Karl
    Grandoni, Fabrizio
    Saha, Barna
    Williams, Virginia Vassilevska
    SIAM JOURNAL ON COMPUTING, 2019, 48 (02) : 481 - 512
  • [43] Approximating tree edit distance through string edit distance
    Akutsu, Tatsuya
    Fukagawa, Daiji
    Takasu, Atsuhiro
    ALGORITHMS AND COMPUTATION, PROCEEDINGS, 2006, 4288 : 90 - +
  • [44] Approximating Tree Edit Distance through String Edit Distance
    Akutsu, Tatsuya
    Fukagawa, Daiji
    Takasu, Atsuhiro
    ALGORITHMICA, 2010, 57 (02) : 325 - 348
  • [45] Graph Edit Distance or Graph Edit Pseudo-Distance?
    Serratosa, Francesc
    Cortes, Xavier
    Moreno, Carlos-Francisco
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2016, 2016, 10029 : 530 - 540
  • [46] Approximating Tree Edit Distance through String Edit Distance
    Tatsuya Akutsu
    Daiji Fukagawa
    Atsuhiro Takasu
    Algorithmica, 2010, 57 : 325 - 348
  • [47] From edit distance to augmented space-time-weighted edit distance: Detecting and clustering patterns of human activities in Puget Sound region
    Zhai, Wei
    Bai, Xueyin
    Peng, Zhong-ren
    Gu, Chaolin
    JOURNAL OF TRANSPORT GEOGRAPHY, 2019, 78 : 41 - 55
  • [48] Efficient sequential and parallel algorithms for finding edit distance based motifs
    Soumitra Pal
    Peng Xiao
    Sanguthevar Rajasekaran
    BMC Genomics, 17
  • [49] Data quality system using reference dictionaries and edit distance algorithms
    Karbarz, Radoslaw
    Mulawka, Jan
    PHOTONICS APPLICATIONS IN ASTRONOMY, COMMUNICATIONS, INDUSTRY, AND HIGH-ENERGY PHYSICS EXPERIMENTS 2015, 2015, 9662
  • [50] Efficient sequential and parallel algorithms for finding edit distance based motifs
    Pal, Soumitra
    Xiao, Peng
    Rajasekaran, Sanguthevar
    BMC GENOMICS, 2016, 17