MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance

被引:6
|
作者
Zhang, Haoyu [1 ]
Zhang, Qin [1 ]
机构
[1] Indiana Univ Bloomington, Bloomington, IN 47405 USA
关键词
JOINS;
D O I
10.1145/3394486.3403099
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study a fundamental problem in data analytics: similarity search under edit distance (or, edit similarity search for short). In this problem we try to build an index on a set of n strings S = {s(1), . . . , s(n)}, with the goal of answering the following two types of queries: (1) the threshold query: given a query string t and a threshold K, output all s(i) is an element of S such that the edit distance between si and t is at most K; (2) the top-k query: given a query string t, output the k strings in S that are closest to t in terms of edit distance. Edit similarity search has numerous applications in bioinformatics, databases, data mining, information retrieval, etc., and has been studied extensively in the literature. In this paper we propose a novel algorithm for edit similarity search named MinSearch. The algorithm is randomized, and we can show mathematically that it outputs the correct answer with high probability for both types of queries. We have conducted an extensive set of experiments on MinSearch, and compared it with the best existing algorithms for edit similarity search. Our experiments show that MinSearch has a clear advantage (often in orders of magnitudes) against the best previous algorithms in query time, and MinSearch is always one of the best among all competitors in the indexing time and space usage. Finally, MinSearch achieves perfect accuracy for both types of queries on all datasets that we have tested.
引用
收藏
页码:566 / 576
页数:11
相关论文
共 50 条
  • [21] Graph Similarity Search with Edit Distance Constraint in Large Graph Databases
    Zheng, Weiguo
    Zou, Lei
    Lian, Xiang
    Wang, Dong
    Zhao, Dongyan
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1595 - 1600
  • [22] A unified framework for string similarity search with edit-distance constraint
    Minghe Yu
    Jin Wang
    Guoliang Li
    Yong Zhang
    Dong Deng
    Jianhua Feng
    [J]. The VLDB Journal, 2017, 26 : 249 - 274
  • [23] Modified edit distance algorithm and its application in web search
    School of Electronics and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
    [J]. Hsi An Chiao Tung Ta Hsueh, 2008, 12 (1450-1454):
  • [24] Top-k String Similarity Search with Edit-Distance Constraints
    Deng, Dong
    Li, Guoliang
    Feng, Jianhua
    Li, Wen-Syan
    [J]. 2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 925 - 936
  • [25] Phrase similarity through the edit distance
    Vilares, M
    Ribadas, FJ
    Vilares, J
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, 3180 : 306 - 317
  • [26] An Efficient Video Similarity Search Algorithm
    Cao, Zheng
    Zhu, Ming
    [J]. IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2010, 56 (02) : 751 - 755
  • [27] FrepJoin: an efficient partition-based algorithm for edit similarity join
    Luo, Ji-zhou
    Shi, Sheng-fei
    Wang, Hong-zhi
    Li, Jian-zhong
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2017, 18 (10) : 1499 - 1510
  • [28] FrepJoin:an efficient partition-based algorithm for edit similarity join
    Ji-zhou LUO
    Sheng-fei SHI
    Hong-zhi WANG
    Jian-zhong LI
    [J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18 (10) : 1499 - 1510
  • [29] FrepJoin: an efficient partition-based algorithm for edit similarity join
    Ji-zhou Luo
    Sheng-fei Shi
    Hong-zhi Wang
    Jian-zhong Li
    [J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18 : 1499 - 1510
  • [30] An Efficient Linear Space Algorithm for Consecutive Suffix Alignment under Edit Distance (Short Preliminary Paper)
    Hyyro, Heikki
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2008, 5280 : 155 - 163