MinJoin++: a fast algorithm for string similarity joins under edit distance

被引:0
|
作者
Nikolai Karpov
Haoyu Zhang
Qin Zhang
机构
[1] Indiana University,
[2] Meta Inc.,undefined
来源
The VLDB Journal | 2024年 / 33卷
关键词
String similarity joins; Edit distance; Local hash minima;
D O I
暂无
中图分类号
学科分类号
摘要
We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested.
引用
下载
收藏
页码:281 / 299
页数:18
相关论文
共 39 条
  • [1] MinJoin plus plus : a fast algorithm for string similarity joins under edit distance
    Karpov, Nikolai
    Zhang, Haoyu
    Zhang, Qin
    VLDB JOURNAL, 2024, 33 (02): : 281 - 299
  • [2] Bounded Occurrence Edit Distance: A New Metric for String Similarity Joins with Edit Distance Constraints
    Komatsu, Tomoki
    Okuta, Ryosuke
    Narisawa, Kazuyuki
    Shinohara, Ayumi
    SOFSEM 2014: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2014, 8327 : 363 - 374
  • [3] MinJoin: Efficient Edit Similarity Joins via Local Hash Minima
    Zhang, Haoyu
    Zhang, Qin
    KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 1093 - 1103
  • [4] A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2013, 38 (02):
  • [5] Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints
    Xiao, Chuan
    Wang, Wei
    Lin, Xuemin
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 933 - 944
  • [6] VChunkJoin: An Efficient Algorithm for Edit Similarity Joins
    Wang, Wei
    Qin, Jianbin
    Xiao, Chuan
    Lin, Xuemin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (08) : 1916 - 1929
  • [7] Efficient Graph Similarity Joins with Edit Distance Constraints
    Zhao, Xiang
    Xiao, Chuan
    Lin, Xuemin
    Wang, Wei
    2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 834 - 845
  • [8] MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance
    Zhang, Haoyu
    Zhang, Qin
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 566 - 576
  • [9] A Novel Similarity Verification Algorithm Under Edit Distance Limitation
    Yu C.-Y.
    Li M.-M.
    Zhao C.
    Ma H.-T.
    Dongbei Daxue Xuebao/Journal of Northeastern University, 2019, 40 (11): : 1543 - 1548
  • [10] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704