Efficiently Supporting Edit Distance Based String Similarity Search Using B+-Trees

被引:24
|
作者
Lu, Wei [1 ]
Du, Xiaoyong [2 ,3 ]
Hadjieleftheriou, Marios [4 ]
Ooi, Beng Chin [1 ]
机构
[1] Natl Univ Singapore, Sch Comp, Singapore 117548, Singapore
[2] Minist Educ, Key Lab Data Engn & Knowledge Engn, Beijing, Peoples R China
[3] Renmin Univ China, Sch Informat, Beijing, Peoples R China
[4] AT&T Labs Res, Florham Pk, NJ 07932 USA
基金
美国国家科学基金会;
关键词
Similarity search; string; edit distance; B+-tree;
D O I
10.1109/TKDE.2014.2309131
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and data sets are maintained in main memory. To overcome this limitation, in this paper, we propose B+-tree based approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions according to a set of reference strings. Then, we index strings in all partitions using a single B+-tree based on the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to efficiently answer range and KNN queries, respectively, based on the B+-tree. We prove that the optimal partitioning of the data set is an NP-hard problem, and therefore propose a heuristic approach for selecting the reference strings greedily and present an optimal partition assignment strategy to minimize the expected number of strings that need to be verified during the query evaluation. Through extensive experiments over a variety of real data sets, we demonstrate that our B+-tree based approaches provide superior performance over state-of-the-art techniques on both range and KNN queries in most cases.
引用
收藏
页码:2983 / 2996
页数:14
相关论文
共 50 条
  • [1] minIL: A Simple and Small Index for String Similarity Search with Edit Distance
    Yang, Zhong
    Zheng, Bolong
    Wang, Xianzhi
    Li, Guohui
    Zhou, Xiaofang
    [J]. 2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 565 - 577
  • [2] A unified framework for string similarity search with edit-distance constraint
    Yu, Minghe
    Wang, Jin
    Li, Guoliang
    Zhang, Yong
    Deng, Dong
    Feng, Jianhua
    [J]. VLDB JOURNAL, 2017, 26 (02): : 249 - 274
  • [3] A unified framework for string similarity search with edit-distance constraint
    Minghe Yu
    Jin Wang
    Guoliang Li
    Yong Zhang
    Dong Deng
    Jianhua Feng
    [J]. The VLDB Journal, 2017, 26 : 249 - 274
  • [4] Top-k String Similarity Search with Edit-Distance Constraints
    Deng, Dong
    Li, Guoliang
    Feng, Jianhua
    Li, Wen-Syan
    [J]. 2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 925 - 936
  • [5] Edit Distance Based Similarity Search of Heterogeneous Information Networks
    Lu, Jianhua
    Lu, Ningyun
    Ma, Sipei
    Zhang, Baili
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA 2018), PT II, 2018, 11030 : 195 - 202
  • [6] Fast Similarity Search for Graphs by Edit Distance
    Rachkovskij, D. A.
    [J]. CYBERNETICS AND SYSTEMS ANALYSIS, 2019, 55 (06) : 1039 - 1051
  • [7] Bounded Occurrence Edit Distance: A New Metric for String Similarity Joins with Edit Distance Constraints
    Komatsu, Tomoki
    Okuta, Ryosuke
    Narisawa, Kazuyuki
    Shinohara, Ayumi
    [J]. SOFSEM 2014: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2014, 8327 : 363 - 374
  • [8] Fast Similarity Search for Graphs by Edit Distance
    D. A. Rachkovskij
    [J]. Cybernetics and Systems Analysis, 2019, 55 : 1039 - 1051
  • [9] Compressed String Dictionary Search with Edit Distance One
    Belazzougui, Djamal
    Venturini, Rossano
    [J]. ALGORITHMICA, 2016, 74 (03) : 1099 - 1122
  • [10] Compressed String Dictionary Search with Edit Distance One
    Djamal Belazzougui
    Rossano Venturini
    [J]. Algorithmica, 2016, 74 : 1099 - 1122