MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance

被引:6
|
作者
Zhang, Haoyu [1 ]
Zhang, Qin [1 ]
机构
[1] Indiana Univ Bloomington, Bloomington, IN 47405 USA
关键词
JOINS;
D O I
10.1145/3394486.3403099
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study a fundamental problem in data analytics: similarity search under edit distance (or, edit similarity search for short). In this problem we try to build an index on a set of n strings S = {s(1), . . . , s(n)}, with the goal of answering the following two types of queries: (1) the threshold query: given a query string t and a threshold K, output all s(i) is an element of S such that the edit distance between si and t is at most K; (2) the top-k query: given a query string t, output the k strings in S that are closest to t in terms of edit distance. Edit similarity search has numerous applications in bioinformatics, databases, data mining, information retrieval, etc., and has been studied extensively in the literature. In this paper we propose a novel algorithm for edit similarity search named MinSearch. The algorithm is randomized, and we can show mathematically that it outputs the correct answer with high probability for both types of queries. We have conducted an extensive set of experiments on MinSearch, and compared it with the best existing algorithms for edit similarity search. Our experiments show that MinSearch has a clear advantage (often in orders of magnitudes) against the best previous algorithms in query time, and MinSearch is always one of the best among all competitors in the indexing time and space usage. Finally, MinSearch achieves perfect accuracy for both types of queries on all datasets that we have tested.
引用
收藏
页码:566 / 576
页数:11
相关论文
共 50 条
  • [1] Toward Efficient Similarity Search under Edit Distance on Hybrid Architectures
    Khalid, Madiha
    Yousaf, Muhammad Murtaza
    Sadiq, Muhammad Umair
    [J]. INFORMATION, 2022, 13 (10)
  • [2] A Novel Similarity Verification Algorithm Under Edit Distance Limitation
    Yu C.-Y.
    Li M.-M.
    Zhao C.
    Ma H.-T.
    [J]. Dongbei Daxue Xuebao/Journal of Northeastern University, 2019, 40 (11): : 1543 - 1548
  • [3] Fast Similarity Search for Graphs by Edit Distance
    Rachkovskij, D. A.
    [J]. CYBERNETICS AND SYSTEMS ANALYSIS, 2019, 55 (06) : 1039 - 1051
  • [4] Fast Similarity Search for Graphs by Edit Distance
    D. A. Rachkovskij
    [J]. Cybernetics and Systems Analysis, 2019, 55 : 1039 - 1051
  • [5] siEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves
    Takabatake, Yoshimasa
    Nakashima, Kenta
    Kuboyama, Tetsuji
    Tabei, Yasuo
    Sakamoto, Hiroshi
    [J]. ALGORITHMS, 2016, 9 (02)
  • [6] Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints
    Xiao, Chuan
    Wang, Wei
    Lin, Xuemin
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 933 - 944
  • [7] MinJoin++: a fast algorithm for string similarity joins under edit distance
    Nikolai Karpov
    Haoyu Zhang
    Qin Zhang
    [J]. The VLDB Journal, 2024, 33 : 281 - 299
  • [8] Fast Subtrajectory Similarity Search in Road Networks under Weighted Edit Distance Constraints
    Koide, Satoshi
    Xiao, Chuan
    Ishikawa, Yoshiharu
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (11): : 2188 - 2201
  • [9] HGED: A Hybrid Search Algorithm for Efficient Parallel Graph Edit Distance Computation
    Kim, Jongik
    [J]. IEEE ACCESS, 2020, 8 : 175776 - 175787
  • [10] Efficient Graph Similarity Joins with Edit Distance Constraints
    Zhao, Xiang
    Xiao, Chuan
    Lin, Xuemin
    Wang, Wei
    [J]. 2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 834 - 845