minIL: A Simple and Small Index for String Similarity Search with Edit Distance

被引:0
|
作者
Yang, Zhong [1 ]
Zheng, Bolong [1 ]
Wang, Xianzhi [2 ]
Li, Guohui [1 ]
Zhou, Xiaofang [3 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] Univ Technol Sydney, Sydney, NSW, Australia
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
关键词
threshold string similarity search; edit distance; invertd index; minhash; EFFICIENT ALGORITHM; JOIN;
D O I
10.1109/ICDE53745.2022.00047
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The string similarity search is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. We study the problem of threshold similarity search with the edit distance, where given a set of strings, a threshold k, and a query string q, we aim to find all strings in the set whose edit distances to q are no larger than k. Extensive studies have been proposed for the threshold similarity search problem with the edit distance. However, they suffer from a huge space consumption issue when achieving only an acceptable efficiency, especially for long strings. In this paper, we propose a simple yet small index, called minIL, to eliminate this issue. First, we adopt a minhash family to capture pivot characters and to construct sketch representations for strings. Second, we develop a multi-level inverted index to search sketches with a low space consumption. Finally, we apply a novel learned index technique on top of the index that further improves the query efficiency. Extensive experiments on real-world datasets offer insight into the performance of our method and show that it substantially reduces the index size, and is capable of outperforming the baseline approaches.
引用
收藏
页码:565 / 577
页数:13
相关论文
共 50 条
  • [1] A unified framework for string similarity search with edit-distance constraint
    Yu, Minghe
    Wang, Jin
    Li, Guoliang
    Zhang, Yong
    Deng, Dong
    Feng, Jianhua
    [J]. VLDB JOURNAL, 2017, 26 (02): : 249 - 274
  • [2] A unified framework for string similarity search with edit-distance constraint
    Minghe Yu
    Jin Wang
    Guoliang Li
    Yong Zhang
    Dong Deng
    Jianhua Feng
    [J]. The VLDB Journal, 2017, 26 : 249 - 274
  • [3] siEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves
    Takabatake, Yoshimasa
    Nakashima, Kenta
    Kuboyama, Tetsuji
    Tabei, Yasuo
    Sakamoto, Hiroshi
    [J]. ALGORITHMS, 2016, 9 (02)
  • [4] Top-k String Similarity Search with Edit-Distance Constraints
    Deng, Dong
    Li, Guoliang
    Feng, Jianhua
    Li, Wen-Syan
    [J]. 2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 925 - 936
  • [5] Fast Similarity Search for Graphs by Edit Distance
    Rachkovskij, D. A.
    [J]. CYBERNETICS AND SYSTEMS ANALYSIS, 2019, 55 (06) : 1039 - 1051
  • [6] Bounded Occurrence Edit Distance: A New Metric for String Similarity Joins with Edit Distance Constraints
    Komatsu, Tomoki
    Okuta, Ryosuke
    Narisawa, Kazuyuki
    Shinohara, Ayumi
    [J]. SOFSEM 2014: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2014, 8327 : 363 - 374
  • [7] Fast Similarity Search for Graphs by Edit Distance
    D. A. Rachkovskij
    [J]. Cybernetics and Systems Analysis, 2019, 55 : 1039 - 1051
  • [8] Efficiently Supporting Edit Distance Based String Similarity Search Using B+-Trees
    Lu, Wei
    Du, Xiaoyong
    Hadjieleftheriou, Marios
    Ooi, Beng Chin
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (12) : 2983 - 2996
  • [9] Compressed String Dictionary Search with Edit Distance One
    Belazzougui, Djamal
    Venturini, Rossano
    [J]. ALGORITHMICA, 2016, 74 (03) : 1099 - 1122
  • [10] Compressed String Dictionary Search with Edit Distance One
    Djamal Belazzougui
    Rossano Venturini
    [J]. Algorithmica, 2016, 74 : 1099 - 1122