Parallelizing String Similarity Join Algorithms

被引:0
|
作者
Yao, Ling-Chih [1 ]
Lim, Lipyeow [1 ]
机构
[1] Univ Hawaii Manoa, Honolulu, HI 96822 USA
来源
关键词
PARTITION-BASED METHOD;
D O I
10.1007/978-3-319-92013-9_27
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A key operation in data cleaning and integration is the use of string similarity join (SSJ) algorithms to identify and remove duplicates or similar records within data sets. With the advent of big data, a natural question is how to parallelize SSJ algorithms. There is a large body of existing work on SSJ algorithms and parallelizing each one of them may not be the most feasible solution. In this paper, we propose a parallelization framework for string similarity joins that utilizes existing SSJ algorithms. Our framework partitions the data using a variety of partitioning strategies and then executes the SSJ algorithms on the partitions in parallel. Some of the partitioning strategies that we investigate trade accuracy for speed. We implemented and validated our framework on several SSJ algorithms and data sets. Our experiments show that our framework results in significant speedup with little loss in accuracy.
引用
收藏
页码:322 / 327
页数:6
相关论文
共 50 条
  • [1] String similarity search and join: a survey
    Yu, Minghe
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2016, 10 (03) : 399 - 417
  • [2] String similarity search and join: a survey
    Minghe Yu
    Guoliang Li
    Dong Deng
    Jianhua Feng
    [J]. Frontiers of Computer Science, 2016, 10 : 399 - 417
  • [3] Incremental processing for string similarity join
    Yan, Cairong
    Zhu, Bin
    Gan, Yanglan
    Xu, Guangwei
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2019, 20 (02) : 255 - 268
  • [4] String Similarity Join with Different Thresholds
    Rong, Chuitian
    Zhang, Xiangling
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2015, 2015, 9403 : 260 - 271
  • [5] String similarity search and join:a survey
    Minghe YU
    Guoliang LI
    Dong DENG
    Jianhua FENG
    [J]. Frontiers of Computer Science., 2016, 10 (03) - 417
  • [6] Hashed-Join: Approximate String Similarity Join with Hashing
    Yuan, Peisen
    Sha, Chaofeng
    Sun, Yi
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, 2014, 8505 : 217 - 229
  • [7] LS-Join: Local Similarity Join on String Collections
    Wang, Jiaying
    Yang, Xiaochun
    Wang, Bin
    Liu, Chengfei
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (09) : 1928 - 1942
  • [8] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    [J]. COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
  • [9] Efficient and Scalable Processing of String Similarity Join
    Rong, Chuitian
    Lu, Wei
    Wang, Xiaoli
    Du, Xiaoyong
    Chen, Yueguo
    Tung, Anthony K. H.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (10) : 2217 - 2230
  • [10] State-of-the-art in String Similarity Search and Join
    Wandelt, Sebastian
    Deng, Dong
    Gerdjikov, Stefan
    Mishra, Shashwat
    Mitankin, Petar
    Patil, Manish
    Siragusa, Enrico
    Tiskin, Alexander
    Wang, Wei
    Wang, Jiaying
    Leser, Ulf
    [J]. SIGMOD RECORD, 2014, 43 (01) : 64 - 76