Efficient Online String Matching Based on Characters Distance Text Sampling

被引：1

作者：

Faro, Simone ^{[1
]}

Marino, Francesco Pio ^{[1
]}

Pavone, Arianna ^{[2
]}

机构：

[1] Univ Catania, Dipartimento Matemat & Informat, Viale A Doria 6, I-95125 Catania, Italy

[2] Univ Messina, Dipartimento Sci Cognit, Via Concez 6, I-98122 Messina, Italy

来源：

ALGORITHMICA | 2020年 / 82卷 / 11期

关键词：

String matching; Text processing; Efficient searching; Text indexing;

D O I：

10.1007/s00453-020-00732-4

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology.Sampled string matchingis an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastically reduce searching time for the online solutions, on the other hand. In this paper we present a new algorithm for the sampled string matching problem, based on a characters distance sampling approach. The main idea is to sample the distances between consecutive occurrences of a givenpivotcharacter and then to search online the sampled data for any occurrence of the sampled pattern, before verifying the original text. From a theoretical point of view we prove that, under suitable conditions, our solution can achieve both linear worst-case time complexity and optimal average-time complexity. From a practical point of view it turns out that our solution shows a sub-linear behaviour in practice and speeds up online searching by a factor of up to 9, using limited additional space whose amount goes from 11 to 2.8% of the text size, with a gain up to 50% if compared with previous solutions.

引用

页码：3390 / 3412

页数：23

共 50 条

[41] EFFICIENT STRING MATCHING WITH K-MISMATCHES
LANDAU, GM
VISHKIN, U
THEORETICAL COMPUTER SCIENCE, 1986, 43 (2-3) : 239 - 249
[42] EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH
AHO, AV
CORASICK, MJ
COMMUNICATIONS OF THE ACM, 1975, 18 (06) : 333 - 340
[43] Efficient parallel hardware algorithms for string matching
Park, JH
George, KM
MICROPROCESSORS AND MICROSYSTEMS, 1999, 23 (03) : 155 - 168
[44] Efficient string matching with wildcards and length constraints
Gong Chen
Xindong Wu
Xingquan Zhu
Abdullah N. Arslan
Yu He
Knowledge and Information Systems, 2006, 10 : 399 - 419
[45] SIMPLE AND EFFICIENT STRING MATCHING WITH K MISMATCHES
GROSSI, R
LUCCIO, F
INFORMATION PROCESSING LETTERS, 1989, 33 (03) : 113 - 120
[46] Efficient string matching in Huffman compressed texts
Fredriksson, K
Tarhio, J
FUNDAMENTA INFORMATICAE, 2004, 63 (01) : 1 - 16
[47] Efficient algorithms for approximate string matching with swaps
Kim, DK
Lee, JS
Park, K
Cho, Y
JOURNAL OF COMPLEXITY, 1999, 15 (01) : 128 - 147
[48] Efficient indexing for Query By String text retrieval
Ghosh, Suman K.
Gomez, Liuis
Karatzas, Dimosthenis
Valveny, Ernest
2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 1236 - 1240
[49] Approximate string matching with variable length don't care characters
Akutsu, T
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1996, E79D (09) : 1353 - 1354
[50] Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation
Oh, Yoori
Han, Yoseob
Lee, Kyogu
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2363 - 2367

← 1 2 3 4 5 →