EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM

被引:1
|
作者
Gouda, Karam [1 ]
Rashad, Metwally [2 ]
机构
[1] Benha Univ, Fac Comp & Informat, Banha, Egypt
[2] Univ Pannonia, Fac Informat Technol, Veszprem, Hungary
关键词
String data; edit distance; trie-based approaches; similarity join;
D O I
10.4149/cai_2017_3_683
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
String similarity join is a basic and essential operation in many applications. In this paper, we investigate the problem of string similarity join with edit distance constraints. A trie-based edit similarity join framework has been proposed recently. The main advantage of existing trie-based algorithms is support for similarity join on short strings. The main problem is when joining long and distant strings. These methods generate and maintain lots of similar prefixes called active nodes which need to be further removed in a subsequent pruning phase. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new trie-based join algorithm called PreJoin, which improves upon current trie-based join methods. It efficiently finds all similar string pairs using a novel active-node generation method, which minimizes the number of generated active nodes by applying the pruning heuristics early in the generation process. The performance of PreJoin is scaled in two different ways: First, a dynamic reordering of the trie index is used to accelerate the search for similar string pairs. Second, a partitioning method of string space is used to improve performance on large edit distance thresholds. Experiments show that our approach is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art trie-based join approaches by a factor five.
引用
收藏
页码:683 / 704
页数:22
相关论文
共 50 条
  • [41] An algorithm for string edit distance allowing substring reversals
    Arslan, Abdullah N.
    [J]. BIBE 2006: SIXTH IEEE SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, PROCEEDINGS, 2006, : 220 - +
  • [42] minIL: A Simple and Small Index for String Similarity Search with Edit Distance
    Yang, Zhong
    Zheng, Bolong
    Wang, Xianzhi
    Li, Guohui
    Zhou, Xiaofang
    [J]. 2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 565 - 577
  • [43] A unified framework for string similarity search with edit-distance constraint
    Yu, Minghe
    Wang, Jin
    Li, Guoliang
    Zhang, Yong
    Deng, Dong
    Feng, Jianhua
    [J]. VLDB JOURNAL, 2017, 26 (02): : 249 - 274
  • [44] A unified framework for string similarity search with edit-distance constraint
    Minghe Yu
    Jin Wang
    Guoliang Li
    Yong Zhang
    Dong Deng
    Jianhua Feng
    [J]. The VLDB Journal, 2017, 26 : 249 - 274
  • [45] Extending String Similarity Join to Tolerant Fuzzy Token Matching
    Wang, Jiannan
    Li, Guoliang
    Feng, Jianhua
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 2014, 39 (01):
  • [46] Efficient Graph Similarity Joins with Edit Distance Constraints
    Zhao, Xiang
    Xiao, Chuan
    Lin, Xuemin
    Wang, Wei
    [J]. 2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 834 - 845
  • [47] EmbedJoin: Efficient Edit Similarity Joins via Embeddings
    Zhang, Haoyu
    Zhang, Qin
    [J]. KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 585 - 594
  • [48] Efficient Privacy Preserving Protocols for Similarity Join
    Hawashin, Bilal
    Fotouhi, Farshad
    Truta, Traian Marius
    Grosky, William
    [J]. TRANSACTIONS ON DATA PRIVACY, 2012, 5 (01) : 297 - 330
  • [49] Efficient subgraph join based on connectivity similarity
    Wang, Yue
    Wang, Hongzhi
    Li, Jianzhong
    Gao, Hong
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2015, 18 (04): : 871 - 887
  • [50] Efficient SimRank-Based Similarity Join
    Zheng, Weiguo
    Zou, Lei
    Chen, Lei
    Zhao, Dongyan
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 2017, 42 (03):