String similarity join with different similarity thresholds based on novel indexing techniques

被引:3
|
作者
Rong, Chuitian [1 ]
Silva, Yasin N. [2 ]
Li, Chunqing [1 ]
机构
[1] Tianjin Polytech Univ, Sch Comp Sci & Software Engn, Tianjin 300387, Peoples R China
[2] Arizona State Univ, Sch Math & Nat Sci, Tempe, AZ 85281 USA
基金
中国国家自然科学基金;
关键词
similarity join; similarity aware index; similarity thresholds; EFFICIENT;
D O I
10.1007/s11704-016-5231-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports different similarity thresholds in a single operator. In order to support different thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while providing a more flexible threshold specification.
引用
收藏
页码:307 / 319
页数:13
相关论文
共 50 条
  • [41] A people similarity based approach to video indexing
    Wang, P
    Ma, YF
    Zhang, HJ
    Yang, SQ
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING SIGNAL, PROCESSING EDUCATION, 2003, : 693 - 696
  • [42] A novel indexing scheme for similarity search in metric spaces
    Tosun, Umut
    [J]. PATTERN RECOGNITION LETTERS, 2015, 54 : 69 - 74
  • [43] Short Answer Grading Using String Similarity And Corpus-Based Similarity
    Gomaa, Wael H.
    Fahmy, Aly A.
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2012, 3 (11) : 115 - 121
  • [44] A Scalable Similarity Join Algorithm Based on MapReduce and LSH
    Rivault, Sebastien
    Bamha, Mostafa
    Limet, Sebastien
    Robert, Sophie
    [J]. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2022, 50 (3-4) : 360 - 380
  • [45] String Similarity Computing Based on Position And Cosine
    Cheng, Na
    Yu, Zhongqing
    Wang, Kaixi
    [J]. PROCEEDINGS OF 2017 IEEE 7TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC), 2017, : 256 - 261
  • [46] String parsing-based similarity detection
    Yang, J
    Speidel, U
    [J]. PROCEEDINGS OF THE IEEE ITSOC INFORMATION THEORY WORKSHOP 2005 ON CODING AND COMPLEXITY, 2005, : 263 - 267
  • [47] SETJoin: a novel top-k similarity join algorithm
    Hongya Wang
    Lihong Yang
    Yingyuan Xiao
    [J]. Soft Computing, 2020, 24 : 14577 - 14592
  • [48] A Scalable Similarity Join Algorithm Based on MapReduce and LSH
    Sébastien Rivault
    Mostafa Bamha
    Sébastien Limet
    Sophie Robert
    [J]. International Journal of Parallel Programming, 2022, 50 : 360 - 380
  • [49] An empirical evaluation of exact set similarity join techniques using GPUs
    Bellas, Christos
    Gounaris, Anastasios
    [J]. INFORMATION SYSTEMS, 2020, 89
  • [50] PPIS-JOIN: A Novel Privacy-Preserving Image Similarity Join Method
    Chengyuan Zhang
    Fangxin Xie
    Hao Yu
    Jianfeng Zhang
    Lei Zhu
    Yangding Li
    [J]. Neural Processing Letters, 2022, 54 : 2783 - 2801