Efficient POSIX submatch extraction on nondeterministic finite automata

被引:3
|
作者
Borsotti, Angelo [1 ]
Trofimovich, Ulya [2 ]
机构
[1] Polytech Univ Milan, Dept Elect Informat & Bioengn, Milan, Italy
[2] Belarusian State Univ, Dept Discrete Math & Algorithm, Minsk, BELARUS
来源
SOFTWARE-PRACTICE & EXPERIENCE | 2021年 / 51卷 / 02期
关键词
finite-state automata; parsing; POSIX; regular expressions; submatch extraction;
D O I
10.1002/spe.2881
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper we study the performance of POSIX submatch extraction algorithms based on nondeterministic finite automata (NFA). We propose an algorithm that combines Laurikari tagged NFA and extended Okui-Suzuki disambiguation. The algorithm works in worst-caseO(n m(2) t)time andO(m(2))space (including preprocessing), wherenis the length of input,mis the size of the regular expression with bounded repetition expanded andtis the number of capturing groups and subexpressions that contain them. On real-world benchmarks our algorithm performs close to theO(n m t)complexity of leftmost-greedy matching, although on artificial benchmarks it can be significantly slower. We propose a lazy version of the algorithm that runs much faster, but requiresO(n m(2))space. We show that the Kuklewicz algorithm is slower in practice, and the backward matching algorithm proposed by Cox is incorrect.
引用
收藏
页码:159 / 192
页数:34
相关论文
共 50 条