FrepJoin:an efficient partition-based algorithm for edit similarity join

被引:0
|
作者
Ji-zhou LUO
Sheng-fei SHI
Hong-zhi WANG
Jian-zhong LI
机构
[1] School of Computer Science and Technology,Harbin Institute of Technology
[2] Guangdong Key Laboratory of Popular High Performance Computers,Key Laboratory of Service Computing and Application
关键词
String similarity join; Edit distance; Filter and refine; Data partition; Combined frequency vectors;
D O I
暂无
中图分类号
TP391.1 [文字信息处理];
学科分类号
081203 ; 0835 ;
摘要
String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics.The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
引用
收藏
页码:1499 / 1510
页数:12
相关论文
共 50 条
  • [1] FrepJoin: an efficient partition-based algorithm for edit similarity join
    Ji-zhou Luo
    Sheng-fei Shi
    Hong-zhi Wang
    Jian-zhong Li
    [J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18 : 1499 - 1510
  • [2] FrepJoin: an efficient partition-based algorithm for edit similarity join
    Luo, Ji-zhou
    Shi, Sheng-fei
    Wang, Hong-zhi
    Li, Jian-zhong
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2017, 18 (10) : 1499 - 1510
  • [3] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    [J]. COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
  • [4] Pass-Join: A Partition-based Method for Similarity Joins
    Li, Guoliang
    Deng, Dong
    Wang, Jiannan
    Feng, Jianhua
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 5 (03): : 253 - 264
  • [5] SELECT-PARTITIONED JOIN - AN IMPROVED PARTITION-BASED JOIN ALGORITHM
    HO, C
    JONG, SP
    MYUNGHWAN, K
    [J]. INFORMATION SYSTEMS, 1991, 16 (02) : 199 - 209
  • [6] A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 2013, 38 (02):
  • [7] Efficient structure similarity searches: a partition-based approach
    Xiang Zhao
    Chuan Xiao
    Xuemin Lin
    Wenjie Zhang
    Yang Wang
    [J]. The VLDB Journal, 2018, 27 : 53 - 78
  • [8] Efficient structure similarity searches: a partition-based approach
    Zhao, Xiang
    Xiao, Chuan
    Lin, Xuemin
    Zhang, Wenjie
    Wang, Yang
    [J]. VLDB JOURNAL, 2018, 27 (01): : 53 - 78
  • [9] An efficient partition-based parallel PageRank algorithm
    Manaskasemsak, B
    Rungsawang, A
    [J]. 11th International Conference on Parallel and Distributed Systems, Vol I, Proceedings, 2005, : 257 - 263
  • [10] Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints
    Xiao, Chuan
    Wang, Wei
    Lin, Xuemin
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 933 - 944