Extending String Similarity Join to Tolerant Fuzzy Token Matching

被引:36
|
作者
Wang, Jiannan [1 ]
Li, Guoliang [1 ]
Feng, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Tsinghua Natl Lab Informat Sci & Technol TNList, Beijing 100084, Peoples R China
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2014年 / 39卷 / 01期
基金
中国国家自然科学基金;
关键词
Algorithms; Performance; Experiment; String similarity join; similarity function; signature scheme; fuzzy token matching-based similarity; weighted tokens; ALGORITHMS;
D O I
10.1145/2535628
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity join that finds similar string pairs between two string sets is an essential operation in many applications and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this article, we propose a new similarity function, called fuzzy-tokenmatching-based similarity which extends token-based similarity functions (e.g., jaccard similarity and cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity function and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. We also extend our techniques to support weighted tokens. Experimental results show that our method achieves high efficiency and result quality and significantly outperforms state-of-the-art approaches.
引用
收藏
页数:45
相关论文
共 50 条
  • [41] Entity Matching with String Transformation and Similarity-Based Features
    Sakai, Kazunori
    Dong, Yuyang
    Oyamada, Masafumi
    Takeoka, Kunihiro
    Okadome, Takeshi
    [J]. SOFTWARE FOUNDATIONS FOR DATA INTEROPERABILITY, SFDI 2021, 2022, 1457 : 76 - 87
  • [42] Supporting similarity operations based on approximate string matching on the web
    Schallehn, E
    Geist, I
    Sattler, KU
    [J]. ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2004: COOPIS, DOA, AND ODBASE, PT 1, PROCEEDINGS, 2004, 3290 : 227 - 244
  • [43] Similarity Detection Method Based on Assembly Language and String Matching
    Shan, Shuqian
    Guo, Fengjuan
    Ren, Jiaxun
    [J]. ADVANCES IN ELECTRONIC COMMERCE, WEB APPLICATION AND COMMUNICATION, VOL 1, 2012, 148 : 363 - +
  • [44] Parallel Corpus Filtering based on Fuzzy String Matching
    Sen, Sukanta
    Ekbal, Asif
    Bhattacharyya, Pushpak
    [J]. FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 289 - 293
  • [45] Analysis and safety engineering of fuzzy string matching algorithms
    Pikies, Malgorzata
    Ali, Junade
    [J]. ISA TRANSACTIONS, 2021, 113 : 1 - 8
  • [46] Fuzzy String Matching Using Sentence Embedding Algorithms
    Rong, Yu
    Hu, Xiaolin
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2016, PT III, 2016, 9949 : 626 - 633
  • [47] Trie-join: a trie-based method for efficient string similarity joins
    Jianhua Feng
    Jiannan Wang
    Guoliang Li
    [J]. The VLDB Journal, 2012, 21 : 437 - 461
  • [48] Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture
    Xu K.
    Nie T.
    Shen D.
    Kou Y.
    Yu G.
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (03): : 598 - 608
  • [49] TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching
    Zeakis, Alexandros
    Skoutas, Dimitrios
    Sacharidis, Dimitris
    Papapetrou, Odysseas
    Koubarakis, Manolis
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 16 (04): : 790 - 802
  • [50] Handling data-skewness in character based string similarity join using Hadoop
    Meena, Kanak
    Tayal, Devendra K.
    Castillo, Oscar
    Jain, Amita
    [J]. APPLIED COMPUTING AND INFORMATICS, 2022, 18 (1/2) : 22 - 44