EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM

被引:1
|
作者
Gouda, Karam [1 ]
Rashad, Metwally [2 ]
机构
[1] Benha Univ, Fac Comp & Informat, Banha, Egypt
[2] Univ Pannonia, Fac Informat Technol, Veszprem, Hungary
关键词
String data; edit distance; trie-based approaches; similarity join;
D O I
10.4149/cai_2017_3_683
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
String similarity join is a basic and essential operation in many applications. In this paper, we investigate the problem of string similarity join with edit distance constraints. A trie-based edit similarity join framework has been proposed recently. The main advantage of existing trie-based algorithms is support for similarity join on short strings. The main problem is when joining long and distant strings. These methods generate and maintain lots of similar prefixes called active nodes which need to be further removed in a subsequent pruning phase. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new trie-based join algorithm called PreJoin, which improves upon current trie-based join methods. It efficiently finds all similar string pairs using a novel active-node generation method, which minimizes the number of generated active nodes by applying the pruning heuristics early in the generation process. The performance of PreJoin is scaled in two different ways: First, a dynamic reordering of the trie index is used to accelerate the search for similar string pairs. Second, a partitioning method of string space is used to improve performance on large edit distance thresholds. Experiments show that our approach is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art trie-based join approaches by a factor five.
引用
收藏
页码:683 / 704
页数:22
相关论文
共 50 条
  • [21] An efficient parallel algorithm for high dimensional similarity join
    Alsabti, K
    Ranka, S
    Singh, V
    [J]. FIRST MERGED INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM & SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING, 1998, : 556 - 560
  • [22] Incremental processing for string similarity join
    Yan, Cairong
    Zhu, Bin
    Gan, Yanglan
    Xu, Guangwei
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2019, 20 (02) : 255 - 268
  • [23] String similarity search and join: a survey
    Minghe Yu
    Guoliang Li
    Dong Deng
    Jianhua Feng
    [J]. Frontiers of Computer Science, 2016, 10 : 399 - 417
  • [24] String similarity search and join: a survey
    Yu, Minghe
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2016, 10 (03) : 399 - 417
  • [25] Parallelizing String Similarity Join Algorithms
    Yao, Ling-Chih
    Lim, Lipyeow
    [J]. DATABASES THEORY AND APPLICATIONS, ADC 2018, 2018, 10837 : 322 - 327
  • [26] String Similarity Join with Different Thresholds
    Rong, Chuitian
    Zhang, Xiangling
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2015, 2015, 9403 : 260 - 271
  • [27] String similarity search and join:a survey
    Minghe YU
    Guoliang LI
    Dong DENG
    Jianhua FENG
    [J]. Frontiers of Computer Science, 2016, 10 (03) : 399 - 417
  • [28] siEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves
    Takabatake, Yoshimasa
    Nakashima, Kenta
    Kuboyama, Tetsuji
    Tabei, Yasuo
    Sakamoto, Hiroshi
    [J]. ALGORITHMS, 2016, 9 (02)
  • [29] MinJoin plus plus : a fast algorithm for string similarity joins under edit distance
    Karpov, Nikolai
    Zhang, Haoyu
    Zhang, Qin
    [J]. VLDB JOURNAL, 2024, 33 (02): : 281 - 299
  • [30] Hashed-Join: Approximate String Similarity Join with Hashing
    Yuan, Peisen
    Sha, Chaofeng
    Sun, Yi
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, 2014, 8505 : 217 - 229