minIL: A Simple and Small Index for String Similarity Search with Edit Distance

被引:0
|
作者
Yang, Zhong [1 ]
Zheng, Bolong [1 ]
Wang, Xianzhi [2 ]
Li, Guohui [1 ]
Zhou, Xiaofang [3 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] Univ Technol Sydney, Sydney, NSW, Australia
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
关键词
threshold string similarity search; edit distance; invertd index; minhash; EFFICIENT ALGORITHM; JOIN;
D O I
10.1109/ICDE53745.2022.00047
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The string similarity search is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. We study the problem of threshold similarity search with the edit distance, where given a set of strings, a threshold k, and a query string q, we aim to find all strings in the set whose edit distances to q are no larger than k. Extensive studies have been proposed for the threshold similarity search problem with the edit distance. However, they suffer from a huge space consumption issue when achieving only an acceptable efficiency, especially for long strings. In this paper, we propose a simple yet small index, called minIL, to eliminate this issue. First, we adopt a minhash family to capture pivot characters and to construct sketch representations for strings. Second, we develop a multi-level inverted index to search sketches with a low space consumption. Finally, we apply a novel learned index technique on top of the index that further improves the query efficiency. Extensive experiments on real-world datasets offer insight into the performance of our method and show that it substantially reduces the index size, and is capable of outperforming the baseline approaches.
引用
收藏
页码:565 / 577
页数:13
相关论文
共 50 条
  • [41] The Edit Distance as a Measure of Perceived Rhythmic Similarity
    Post, Olaf
    Toussaint, Godfried
    [J]. EMPIRICAL MUSICOLOGY REVIEW, 2011, 6 (03): : 164 - 179
  • [42] Graph Similarity Using Tree Edit Distance
    Dwivedi, Shri Prakash
    Srivastava, Vishal
    Gupta, Umesh
    [J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2022, 2022, 13813 : 233 - 241
  • [43] Edit distance for a run-length-encoded string and an uncompressed string
    Liu, J. J.
    Huang, G. S.
    Wang, Y. L.
    Lee, R. C. T.
    [J]. INFORMATION PROCESSING LETTERS, 2007, 105 (01) : 12 - 16
  • [44] Distance-Based Index Structures for Fast Similarity Search
    Rachkovskij D.A.
    [J]. Cybernetics and Systems Analysis, 2017, 53 (04) : 636 - 658
  • [45] Online Pattern Matching for String Edit Distance with Moves
    Takabatake, Yoshimasa
    Tabei, Yasuo
    Sakamoto, Hiroshi
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, SPIRE 2014, 2014, 8799 : 203 - 214
  • [46] Computing the Expected Edit Distance from a String to a PFA
    Calvo-Zaragoza, Jorge
    de la Higuera, Colin
    Oncina, Jose
    [J]. Implementation and Application of Automata, 2016, 9705 : 39 - 50
  • [47] Online signature verification based on string edit distance
    Riesen, Kaspar
    Schmidt, Roman
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2019, 22 (01) : 41 - 54
  • [48] An algorithm for string edit distance allowing substring reversals
    Arslan, Abdullah N.
    [J]. BIBE 2006: SIXTH IEEE SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, PROCEEDINGS, 2006, : 220 - +
  • [49] String edit distance, random walks and graph matching
    Robles-Kelly, A
    Hancock, ER
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2004, 18 (03) : 315 - 327
  • [50] Contour Regularity Extraction Based on String Edit Distance
    Salas, Jose Ignacio Abreu
    Ramon Rico-Juan, Juan
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS, PROCEEDINGS, 2009, 5524 : 160 - +