A unified framework for string similarity search with edit-distance constraint

被引:0
|
作者
Minghe Yu
Jin Wang
Guoliang Li
Yong Zhang
Dong Deng
Jianhua Feng
机构
[1] Tsinghua University,Department of Computer Science and Technology
来源
The VLDB Journal | 2017年 / 26卷
关键词
Similarity search; Edit distance; Top-; Disk-based method; Partition;
D O I
暂无
中图分类号
学科分类号
摘要
String similarity search is a fundamental operation in data cleaning and integration. It has two variants: threshold-based string similarity search and top-k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document} string similarity search. Existing algorithms are efficient for either the former or the latter; most of them cannot support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (HS-Tree\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {HS}}{\text {-}}{\textsf {Tree}}$$\end{document}) on top of the segments. Then, we utilize the HS-Tree\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {HS}}{\text {-}}{\textsf {Tree}}$$\end{document} to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document} search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings and propose an algorithm (HS-Topk). We develop effective pruning techniques to further improve the performance. To support large data sets, we extend our techniques to support the disk-based setting. Experimental results on real-world data sets show that our method achieves high performance on the two problems and outperforms state-of-the-art algorithms by 5–10 times.
引用
收藏
页码:249 / 274
页数:25
相关论文
共 50 条
  • [1] A unified framework for string similarity search with edit-distance constraint
    Yu, Minghe
    Wang, Jin
    Li, Guoliang
    Zhang, Yong
    Deng, Dong
    Feng, Jianhua
    VLDB JOURNAL, 2017, 26 (02): : 249 - 274
  • [2] Top-k String Similarity Search with Edit-Distance Constraints
    Deng, Dong
    Li, Guoliang
    Feng, Jianhua
    Li, Wen-Syan
    2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 925 - 936
  • [3] A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2013, 38 (02):
  • [4] Invariance of edit-distance to tempo in rhythm similarity
    Moritz, Matthew
    Heard, Matthew
    Kim, Hyun-Woong
    Lee, Yune S.
    PSYCHOLOGY OF MUSIC, 2021, 49 (06) : 1671 - 1685
  • [5] Computing the Shortest String and the Edit-Distance for Parsing Expression Languages
    Cheon, Hyunjoon
    Han, Yo-Sub
    DEVELOPMENTS IN LANGUAGE THEORY, DLT 2020, 2020, 12086 : 43 - 54
  • [6] Unified Compression-Based Acceleration of Edit-Distance Computation
    Hermelin, Danny
    Landau, Gad M.
    Landau, Shir
    Weimann, Oren
    ALGORITHMICA, 2013, 65 (02) : 339 - 353
  • [7] minIL: A Simple and Small Index for String Similarity Search with Edit Distance
    Yang, Zhong
    Zheng, Bolong
    Wang, Xianzhi
    Li, Guohui
    Zhou, Xiaofang
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 565 - 577
  • [8] Unified Compression-Based Acceleration of Edit-Distance Computation
    Danny Hermelin
    Gad M. Landau
    Shir Landau
    Oren Weimann
    Algorithmica, 2013, 65 : 339 - 353
  • [9] Graph Similarity Search with Edit Distance Constraint in Large Graph Databases
    Zheng, Weiguo
    Zou, Lei
    Lian, Xiang
    Wang, Dong
    Zhao, Dongyan
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1595 - 1600
  • [10] Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints
    Wang, Jiannan
    Feng, Jianhua
    Li, Guoliang
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01): : 1219 - 1230