EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM

被引:1
|
作者
Gouda, Karam [1 ]
Rashad, Metwally [2 ]
机构
[1] Benha Univ, Fac Comp & Informat, Banha, Egypt
[2] Univ Pannonia, Fac Informat Technol, Veszprem, Hungary
关键词
String data; edit distance; trie-based approaches; similarity join;
D O I
10.4149/cai_2017_3_683
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
String similarity join is a basic and essential operation in many applications. In this paper, we investigate the problem of string similarity join with edit distance constraints. A trie-based edit similarity join framework has been proposed recently. The main advantage of existing trie-based algorithms is support for similarity join on short strings. The main problem is when joining long and distant strings. These methods generate and maintain lots of similar prefixes called active nodes which need to be further removed in a subsequent pruning phase. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new trie-based join algorithm called PreJoin, which improves upon current trie-based join methods. It efficiently finds all similar string pairs using a novel active-node generation method, which minimizes the number of generated active nodes by applying the pruning heuristics early in the generation process. The performance of PreJoin is scaled in two different ways: First, a dynamic reordering of the trie index is used to accelerate the search for similar string pairs. Second, a partitioning method of string space is used to improve performance on large edit distance thresholds. Experiments show that our approach is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art trie-based join approaches by a factor five.
引用
收藏
页码:683 / 704
页数:22
相关论文
共 50 条
  • [1] FrepJoin: an efficient partition-based algorithm for edit similarity join
    Ji-zhou Luo
    Sheng-fei Shi
    Hong-zhi Wang
    Jian-zhong Li
    [J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18 : 1499 - 1510
  • [2] FrepJoin: an efficient partition-based algorithm for edit similarity join
    Luo, Ji-zhou
    Shi, Sheng-fei
    Wang, Hong-zhi
    Li, Jian-zhong
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2017, 18 (10) : 1499 - 1510
  • [3] FrepJoin:an efficient partition-based algorithm for edit similarity join
    Ji-zhou LUO
    Sheng-fei SHI
    Hong-zhi WANG
    Jian-zhong LI
    [J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18 (10) : 1499 - 1510
  • [4] Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints
    Xiao, Chuan
    Wang, Wei
    Lin, Xuemin
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 933 - 944
  • [5] Efficient and Scalable Processing of String Similarity Join
    Rong, Chuitian
    Lu, Wei
    Wang, Xiaoli
    Du, Xiaoyong
    Chen, Yueguo
    Tung, Anthony K. H.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (10) : 2217 - 2230
  • [6] Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints
    Wang, Jiannan
    Feng, Jianhua
    Li, Guoliang
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01): : 1219 - 1230
  • [7] VChunkJoin: An Efficient Algorithm for Edit Similarity Joins
    Wang, Wei
    Qin, Jianbin
    Xiao, Chuan
    Lin, Xuemin
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (08) : 1916 - 1929
  • [8] An Efficient Similarity Join Algorithm with Cosine Similarity Predicate
    Lee, Dongjoo
    Park, Jaehui
    Shim, Junho
    Lee, Sang-goo
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT 2, 2010, 6262 : 422 - +
  • [9] Efficient string similarity join in multi-core and distributed systems
    Yan, Cairong
    Zhao, Xue
    Zhang, Qinglong
    Huang, Yongfeng
    [J]. PLOS ONE, 2017, 12 (03):
  • [10] MinJoin++: a fast algorithm for string similarity joins under edit distance
    Nikolai Karpov
    Haoyu Zhang
    Qin Zhang
    [J]. The VLDB Journal, 2024, 33 : 281 - 299