Extending String Similarity Join to Tolerant Fuzzy Token Matching

被引:36
|
作者
Wang, Jiannan [1 ]
Li, Guoliang [1 ]
Feng, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Tsinghua Natl Lab Informat Sci & Technol TNList, Beijing 100084, Peoples R China
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2014年 / 39卷 / 01期
基金
中国国家自然科学基金;
关键词
Algorithms; Performance; Experiment; String similarity join; similarity function; signature scheme; fuzzy token matching-based similarity; weighted tokens; ALGORITHMS;
D O I
10.1145/2535628
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity join that finds similar string pairs between two string sets is an essential operation in many applications and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this article, we propose a new similarity function, called fuzzy-tokenmatching-based similarity which extends token-based similarity functions (e.g., jaccard similarity and cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity function and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. We also extend our techniques to support weighted tokens. Experimental results show that our method achieves high efficiency and result quality and significantly outperforms state-of-the-art approaches.
引用
收藏
页数:45
相关论文
共 50 条
  • [21] String similarity join with different similarity thresholds based on novel indexing techniques
    Rong, Chuitian
    Silva, Yasin N.
    Li, Chunqing
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2017, 11 (02) : 307 - 319
  • [22] String similarity join with different similarity thresholds based on novel indexing techniques
    Chuitian Rong
    Yasin N. Silva
    Chunqing Li
    [J]. Frontiers of Computer Science, 2017, 11 : 307 - 319
  • [23] Fuzzy String Matching with Finite Automat
    Kostanyan, Armen
    [J]. 2017 ELEVENTH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGIES (CSIT), 2017, : 9 - 11
  • [24] Approximate String Matching by Fuzzy Automata
    Snasel, Vaclav
    Keprt, Ales
    Abraham, Ajith
    Hassanien, Aboul Ella
    [J]. MAN-MACHINE INTERACTIONS, 2009, 59 : 281 - +
  • [25] Generalized fuzzy indices for similarity matching
    Tolias, YA
    Panas, SM
    Tsoukalas, LH
    [J]. FUZZY SETS AND SYSTEMS, 2001, 120 (02) : 255 - 270
  • [26] Efficient string similarity join in multi-core and distributed systems
    Yan, Cairong
    Zhao, Xue
    Zhang, Qinglong
    Huang, Yongfeng
    [J]. PLOS ONE, 2017, 12 (03):
  • [28] Highly Efficient String Similarity Search and Join over Compressed Indexes
    Xiao, Guorui
    Wang, Jin
    Lin, Chunbin
    Zaniolo, Carlo
    [J]. 2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 232 - 244
  • [29] Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 1 - 13
  • [30] Leveraging deletion neighborhoods and trie for efficient string similarity search and join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    [J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8870