Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join

被引:0
|
作者
Wang, Jiannan [1 ]
Li, Guoliang [1 ]
Fe, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
来源
IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011) | 2011年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called "fuzzy token matching based similarity", which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
引用
收藏
页码:458 / 469
页数:12
相关论文
共 50 条
  • [21] Efficient string similarity join in multi-core and distributed systems
    Yan, Cairong
    Zhao, Xue
    Zhang, Qinglong
    Huang, Yongfeng
    PLOS ONE, 2017, 12 (03):
  • [22] Highly Efficient String Similarity Search and Join over Compressed Indexes
    Xiao, Guorui
    Wang, Jin
    Lin, Chunbin
    Zaniolo, Carlo
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 232 - 244
  • [24] Leveraging deletion neighborhoods and trie for efficient string similarity search and join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8870
  • [25] Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 1 - 13
  • [26] LS-Join: Local Similarity Join on String Collections (Extended Abstract)
    Wang, Jiaying
    Yang, Xiaochun
    Wang, Bin
    Liu, Chengfei
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 1779 - 1780
  • [27] String similarity join with different similarity thresholds based on novel indexing techniques
    Rong, Chuitian
    Silva, Yasin N.
    Li, Chunqing
    FRONTIERS OF COMPUTER SCIENCE, 2017, 11 (02) : 307 - 319
  • [28] String similarity join with different similarity thresholds based on novel indexing techniques
    Chuitian Rong
    Yasin N. Silva
    Chunqing Li
    Frontiers of Computer Science, 2017, 11 : 307 - 319
  • [29] Efficient similarity join for certain graphs
    Ruan, Qunsheng
    Wu, Qingfeng
    Liu, Xiling
    Miao, Fengyu
    Wang, Yingdong
    MICROSYSTEM TECHNOLOGIES-MICRO-AND NANOSYSTEMS-INFORMATION STORAGE AND PROCESSING SYSTEMS, 2021, 27 (04): : 1665 - 1685
  • [30] Efficient similarity join for certain graphs
    Qunsheng Ruan
    Qingfeng Wu
    Xiling Liu
    Fengyu Miao
    Yingdong Wang
    Microsystem Technologies, 2021, 27 : 1665 - 1685