MinJoin++: a fast algorithm for string similarity joins under edit distance

被引:0
|
作者
Nikolai Karpov
Haoyu Zhang
Qin Zhang
机构
[1] Indiana University,
[2] Meta Inc.,undefined
来源
The VLDB Journal | 2024年 / 33卷
关键词
String similarity joins; Edit distance; Local hash minima;
D O I
暂无
中图分类号
学科分类号
摘要
We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested.
引用
收藏
页码:281 / 299
页数:18
相关论文
共 39 条
  • [11] Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints
    Wang, Jiannan
    Feng, Jianhua
    Li, Guoliang
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01): : 1219 - 1230
  • [12] Fast Similarity Search for Graphs by Edit Distance
    Rachkovskij, D. A.
    CYBERNETICS AND SYSTEMS ANALYSIS, 2019, 55 (06) : 1039 - 1051
  • [13] Fast Similarity Search for Graphs by Edit Distance
    D. A. Rachkovskij
    Cybernetics and Systems Analysis, 2019, 55 : 1039 - 1051
  • [14] An algorithm for string edit distance allowing substring reversals
    Arslan, Abdullah N.
    BIBE 2006: SIXTH IEEE SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, PROCEEDINGS, 2006, : 220 - +
  • [15] Fast Subtrajectory Similarity Search in Road Networks under Weighted Edit Distance Constraints
    Koide, Satoshi
    Xiao, Chuan
    Ishikawa, Yoshiharu
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (11): : 2188 - 2201
  • [16] A unified framework for string similarity search with edit-distance constraint
    Yu, Minghe
    Wang, Jin
    Li, Guoliang
    Zhang, Yong
    Deng, Dong
    Feng, Jianhua
    VLDB JOURNAL, 2017, 26 (02): : 249 - 274
  • [17] minIL: A Simple and Small Index for String Similarity Search with Edit Distance
    Yang, Zhong
    Zheng, Bolong
    Wang, Xianzhi
    Li, Guohui
    Zhou, Xiaofang
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 565 - 577
  • [18] A unified framework for string similarity search with edit-distance constraint
    Minghe Yu
    Jin Wang
    Guoliang Li
    Yong Zhang
    Dong Deng
    Jianhua Feng
    The VLDB Journal, 2017, 26 : 249 - 274
  • [19] A new algorithm for image similarity measure and graph edit distance
    Xiao, Bing
    Li, Jie
    Gao, Xin-Bo
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2009, 37 (10): : 2205 - 2210
  • [20] Top-k String Similarity Search with Edit-Distance Constraints
    Deng, Dong
    Li, Guoliang
    Feng, Jianhua
    Li, Wen-Syan
    2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 925 - 936