minIL: A Simple and Small Index for String Similarity Search with Edit Distance

被引:0
|
作者
Yang, Zhong [1 ]
Zheng, Bolong [1 ]
Wang, Xianzhi [2 ]
Li, Guohui [1 ]
Zhou, Xiaofang [3 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] Univ Technol Sydney, Sydney, NSW, Australia
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
关键词
threshold string similarity search; edit distance; invertd index; minhash; EFFICIENT ALGORITHM; JOIN;
D O I
10.1109/ICDE53745.2022.00047
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The string similarity search is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. We study the problem of threshold similarity search with the edit distance, where given a set of strings, a threshold k, and a query string q, we aim to find all strings in the set whose edit distances to q are no larger than k. Extensive studies have been proposed for the threshold similarity search problem with the edit distance. However, they suffer from a huge space consumption issue when achieving only an acceptable efficiency, especially for long strings. In this paper, we propose a simple yet small index, called minIL, to eliminate this issue. First, we adopt a minhash family to capture pivot characters and to construct sketch representations for strings. Second, we develop a multi-level inverted index to search sketches with a low space consumption. Finally, we apply a novel learned index technique on top of the index that further improves the query efficiency. Extensive experiments on real-world datasets offer insight into the performance of our method and show that it substantially reduces the index size, and is capable of outperforming the baseline approaches.
引用
收藏
页码:565 / 577
页数:13
相关论文
共 50 条
  • [31] Explaining Propagators for String Edit Distance Constraints
    Winter, Felix
    Muslin, Nysret
    Stuckey, Peter J.
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 1676 - 1683
  • [32] Fast Subtrajectory Similarity Search in Road Networks under Weighted Edit Distance Constraints
    Koide, Satoshi
    Xiao, Chuan
    Ishikawa, Yoshiharu
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (11): : 2188 - 2201
  • [33] Classes of cost functions for string edit distance
    S. V. Rice
    H. Bunke
    T. A. Nartker
    [J]. Algorithmica, 1997, 18 : 271 - 280
  • [34] The String Edit Distance Matching Problem With Moves
    Cormode, Graham
    Muthukrishnan, S.
    [J]. ACM TRANSACTIONS ON ALGORITHMS, 2007, 3 (01)
  • [35] The string edit distance matching problem with moves
    Cormode, G
    Muthukrishnan, S
    [J]. PROCEEDINGS OF THE THIRTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2002, : 667 - 676
  • [36] Oblivious String Embeddings and Edit Distance Approximations
    Batu, Tugkan
    Ergun, Funda
    Sahinalp, Cenk
    [J]. PROCEEDINGS OF THE SEVENTHEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2006, : 792 - 801
  • [37] Classes of cost functions for string edit distance
    Rice, SV
    Bunke, H
    Nartker, TA
    [J]. ALGORITHMICA, 1997, 18 (02) : 271 - 280
  • [38] A comparative analysis of the Tanimoto index and graph edit distance for measuring the topological similarity of trees
    Dehmer, Matthias
    Varmuza, Kurt
    [J]. APPLIED MATHEMATICS AND COMPUTATION, 2015, 259 : 242 - 250
  • [39] Similarity of DTDs Based on Edit Distance and Semantics
    Wojnar, Ales
    Mlynkova, Irena
    Dokulil, Jiri
    [J]. INTELLIGENT DISTRIBUTED COMPUTING, SYSTEMS AND APPLICATIONS, 2008, 162 : 207 - 216
  • [40] Chemical Similarity Based on Map Edit Distance
    Li, Xin
    Lyu, Xiaoqing
    Tang, Zhi
    Zhang, Hao
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 1220 - 1222