minIL: A Simple and Small Index for String Similarity Search with Edit Distance

被引:0
|
作者
Yang, Zhong [1 ]
Zheng, Bolong [1 ]
Wang, Xianzhi [2 ]
Li, Guohui [1 ]
Zhou, Xiaofang [3 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] Univ Technol Sydney, Sydney, NSW, Australia
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
关键词
threshold string similarity search; edit distance; invertd index; minhash; EFFICIENT ALGORITHM; JOIN;
D O I
10.1109/ICDE53745.2022.00047
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The string similarity search is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. We study the problem of threshold similarity search with the edit distance, where given a set of strings, a threshold k, and a query string q, we aim to find all strings in the set whose edit distances to q are no larger than k. Extensive studies have been proposed for the threshold similarity search problem with the edit distance. However, they suffer from a huge space consumption issue when achieving only an acceptable efficiency, especially for long strings. In this paper, we propose a simple yet small index, called minIL, to eliminate this issue. First, we adopt a minhash family to capture pivot characters and to construct sketch representations for strings. Second, we develop a multi-level inverted index to search sketches with a low space consumption. Finally, we apply a novel learned index technique on top of the index that further improves the query efficiency. Extensive experiments on real-world datasets offer insight into the performance of our method and show that it substantially reduces the index size, and is capable of outperforming the baseline approaches.
引用
收藏
页码:565 / 577
页数:13
相关论文
共 50 条
  • [21] A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 2013, 38 (02):
  • [22] MinJoin plus plus : a fast algorithm for string similarity joins under edit distance
    Karpov, Nikolai
    Zhang, Haoyu
    Zhang, Qin
    [J]. VLDB JOURNAL, 2024, 33 (02): : 281 - 299
  • [23] A New String Edit Distance and Applications
    Petty, Taylor
    Hannig, Jan
    Huszar, Tunde, I
    Iyer, Hari
    [J]. ALGORITHMS, 2022, 15 (07)
  • [24] String reconciliation with unknown edit distance
    Kontorovich, Aryeh
    Trachtenberg, Ari
    [J]. 2012 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY PROCEEDINGS (ISIT), 2012,
  • [25] Learning string-edit distance
    Ristad, ES
    Yianilos, PN
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1998, 20 (05) : 522 - 532
  • [26] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    [J]. COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
  • [27] Phrase similarity through the edit distance
    Vilares, M
    Ribadas, FJ
    Vilares, J
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, 3180 : 306 - 317
  • [28] Approximating Tree Edit Distance through String Edit Distance for Binary Tree Codes
    Aratsu, Taku
    Hirata, Kouichi
    Kuboyama, Tetsuji
    [J]. FUNDAMENTA INFORMATICAE, 2010, 101 (03) : 157 - 171
  • [29] Approximating Tree Edit Distance through String Edit Distance for Binary Tree Codes
    Aratsu, Taku
    Hirata, Kouichi
    Kuboyama, Tetsuji
    [J]. SOFSEM 2009-THEORY AND PRACTICE OF COMPUTER SCIENCE, PROCEEDINGS, 2009, 5404 : 93 - +
  • [30] An Improved String Similarity Measure Based on Combining Information-Theoretic and Edit Distance Methods
    Thi Thuy Anh Nguyen
    Conrad, Stefan
    [J]. KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT, IC3K 2014, 2015, 553 : 228 - 239