MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance

被引:6
|
作者
Zhang, Haoyu [1 ]
Zhang, Qin [1 ]
机构
[1] Indiana Univ Bloomington, Bloomington, IN 47405 USA
关键词
JOINS;
D O I
10.1145/3394486.3403099
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study a fundamental problem in data analytics: similarity search under edit distance (or, edit similarity search for short). In this problem we try to build an index on a set of n strings S = {s(1), . . . , s(n)}, with the goal of answering the following two types of queries: (1) the threshold query: given a query string t and a threshold K, output all s(i) is an element of S such that the edit distance between si and t is at most K; (2) the top-k query: given a query string t, output the k strings in S that are closest to t in terms of edit distance. Edit similarity search has numerous applications in bioinformatics, databases, data mining, information retrieval, etc., and has been studied extensively in the literature. In this paper we propose a novel algorithm for edit similarity search named MinSearch. The algorithm is randomized, and we can show mathematically that it outputs the correct answer with high probability for both types of queries. We have conducted an extensive set of experiments on MinSearch, and compared it with the best existing algorithms for edit similarity search. Our experiments show that MinSearch has a clear advantage (often in orders of magnitudes) against the best previous algorithms in query time, and MinSearch is always one of the best among all competitors in the indexing time and space usage. Finally, MinSearch achieves perfect accuracy for both types of queries on all datasets that we have tested.
引用
收藏
页码:566 / 576
页数:11
相关论文
共 50 条
  • [11] VChunkJoin: An Efficient Algorithm for Edit Similarity Joins
    Wang, Wei
    Qin, Jianbin
    Xiao, Chuan
    Lin, Xuemin
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (08) : 1916 - 1929
  • [12] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    [J]. COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
  • [13] Edit Distance Based Similarity Search of Heterogeneous Information Networks
    Lu, Jianhua
    Lu, Ningyun
    Ma, Sipei
    Zhang, Baili
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA 2018), PT II, 2018, 11030 : 195 - 202
  • [14] MinJoin plus plus : a fast algorithm for string similarity joins under edit distance
    Karpov, Nikolai
    Zhang, Haoyu
    Zhang, Qin
    [J]. VLDB JOURNAL, 2024, 33 (02): : 281 - 299
  • [15] An efficient algorithm for graph edit distance computation
    Chen, Xiaoyang
    Huo, Hongwei
    Huan, Jun
    Vitter, Jeffrey Scott
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 163 : 762 - 775
  • [16] A new algorithm for image similarity measure and graph edit distance
    Xiao, Bing
    Li, Jie
    Gao, Xin-Bo
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2009, 37 (10): : 2205 - 2210
  • [17] Efficient processing of graph similarity queries with edit distance constraints
    Zhao, Xiang
    Xiao, Chuan
    Lin, Xuemin
    Wang, Wei
    Ishikawa, Yoshiharu
    [J]. VLDB JOURNAL, 2013, 22 (06): : 727 - 752
  • [18] Efficient processing of graph similarity queries with edit distance constraints
    Xiang Zhao
    Chuan Xiao
    Xuemin Lin
    Wei Wang
    Yoshiharu Ishikawa
    [J]. The VLDB Journal, 2013, 22 : 727 - 752
  • [19] A unified framework for string similarity search with edit-distance constraint
    Yu, Minghe
    Wang, Jin
    Li, Guoliang
    Zhang, Yong
    Deng, Dong
    Feng, Jianhua
    [J]. VLDB JOURNAL, 2017, 26 (02): : 249 - 274
  • [20] minIL: A Simple and Small Index for String Similarity Search with Edit Distance
    Yang, Zhong
    Zheng, Bolong
    Wang, Xianzhi
    Li, Guohui
    Zhou, Xiaofang
    [J]. 2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 565 - 577