Extending String Similarity Join to Tolerant Fuzzy Token Matching

被引:36
|
作者
Wang, Jiannan [1 ]
Li, Guoliang [1 ]
Feng, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Tsinghua Natl Lab Informat Sci & Technol TNList, Beijing 100084, Peoples R China
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2014年 / 39卷 / 01期
基金
中国国家自然科学基金;
关键词
Algorithms; Performance; Experiment; String similarity join; similarity function; signature scheme; fuzzy token matching-based similarity; weighted tokens; ALGORITHMS;
D O I
10.1145/2535628
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity join that finds similar string pairs between two string sets is an essential operation in many applications and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this article, we propose a new similarity function, called fuzzy-tokenmatching-based similarity which extends token-based similarity functions (e.g., jaccard similarity and cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity function and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. We also extend our techniques to support weighted tokens. Experimental results show that our method achieves high efficiency and result quality and significantly outperforms state-of-the-art approaches.
引用
收藏
页数:45
相关论文
共 50 条
  • [1] Fast-join: An efficient method for fuzzy token matching based string similarity join
    Wang, Jiannan
    Li, Guoliang
    Fe, Jianhua
    [J]. Proceedings - International Conference on Data Engineering, 2011, : 458 - 469
  • [2] Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join
    Wang, Jiannan
    Li, Guoliang
    Fe, Jianhua
    [J]. IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011), 2011, : 458 - 469
  • [3] BipartiteJoin: Optimal Similarity Join for Fuzzy Bipartite Matching
    Rozinek, Ondrej
    Borkovcova, Monika
    Mares, Jan
    [J]. GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 6, WORLDCIST 2024, 2024, 990 : 171 - 180
  • [4] MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering
    Wang, Jin
    Lin, Chunbin
    Zaniolo, Carlo
    [J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 386 - 397
  • [5] Incremental processing for string similarity join
    Yan, Cairong
    Zhu, Bin
    Gan, Yanglan
    Xu, Guangwei
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2019, 20 (02) : 255 - 268
  • [6] String similarity search and join: a survey
    Minghe Yu
    Guoliang Li
    Dong Deng
    Jianhua Feng
    [J]. Frontiers of Computer Science, 2016, 10 : 399 - 417
  • [7] String similarity search and join: a survey
    Yu, Minghe
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2016, 10 (03) : 399 - 417
  • [8] Parallelizing String Similarity Join Algorithms
    Yao, Ling-Chih
    Lim, Lipyeow
    [J]. DATABASES THEORY AND APPLICATIONS, ADC 2018, 2018, 10837 : 322 - 327
  • [9] String Similarity Join with Different Thresholds
    Rong, Chuitian
    Zhang, Xiangling
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2015, 2015, 9403 : 260 - 271
  • [10] String similarity search and join:a survey
    Minghe YU
    Guoliang LI
    Dong DENG
    Jianhua FENG
    [J]. Frontiers of Computer Science, 2016, 10 (03) : 399 - 417