Similarity Joins for Uncertain Strings

被引:0
|
作者
Patil, Manish [1 ]
Shah, Rahul [1 ]
机构
[1] Louisiana State Univ, Baton Rouge, LA 70803 USA
基金
美国国家科学基金会;
关键词
Uncertain strings; string joins; edit distance;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A string similarity join finds all similar string pairs between two input string collections. It is an essential operation in many applications, such as data integration and cleaning, and has been extensively studied for deterministic strings. Increasingly, many applications have to deal with imprecise strings or strings with fuzzy information in them. This work presents the first solution for answering similarity join queries over uncertain strings that implements possible-world semantics, using the edit distance as the measure of similarity. Given two collections of uncertain strings R, S, and input (k, tau), our task is to find string pairs (R, S) between collections such that Pr(ed(R, S) <= k) > tau i.e., the probability of the edit distance between R and S being at most k is more than probability threshold tau. We can address the join problem by obtaining all strings in S that are similar to each string R in R. However, existing solutions for answering such similarity search queries on uncertain string databases only support a deterministic string as input. Exploiting these solutions would require exponentially many possible worlds of R to be considered, which is not only ineffective but also prohibitively expensive. We propose various filtering techniques that give upper and (or) lower bound on Pr(ed(R, S) <= k) without instantiating possible worlds for either of the strings. We then incorporate these techniques into an indexing scheme and significantly reduce the filtering overhead. Further, we alleviate the verification cost of a string pair that survives pruning by using a trie structure which allows us to overlap the verification cost of exponentially many possible instances of the candidate string pair. Finally, we evaluate the effectiveness of the proposed approach by thorough practical experimentation.
引用
收藏
页码:1471 / 1482
页数:12
相关论文
共 50 条
  • [41] Top-k String Similarity Joins
    Qi, Shuyao
    Bouros, Panagiotis
    Mamoulis, Nikos
    PROCEEDINGS OF THE 32TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, SSDBM 2020, 2020,
  • [42] The similarity of two strings of fuzzy sets
    Andrejková, G
    KYBERNETIKA, 2000, 36 (06) : 671 - 687
  • [43] Similarity Joins: Their implementation and interactions with other database operators
    Silva, Yasin N.
    Pearson, Spencer S.
    Chon, Jaime
    Roberts, Ryan
    INFORMATION SYSTEMS, 2015, 52 : 149 - 162
  • [44] EmbedJoin: Efficient Edit Similarity Joins via Embeddings
    Zhang, Haoyu
    Zhang, Qin
    KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 585 - 594
  • [45] Efficient Similarity Joins for Near-Duplicate Detection
    Xiao, Chuan
    Wang, Wei
    Lin, Xuemin
    Yu, Jeffrey Xu
    Wang, Guoren
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2011, 36 (03):
  • [46] On-the-Fly Token Similarity Joins in Relational Databases
    Augsten, Nikolaus
    Miraglia, Armando
    Neumann, Thomas
    Kemper, Alfons
    SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 1495 - 1506
  • [47] Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint
    Xu, Pengfei
    Lu, Jiaheng
    CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, : 1563 - 1566
  • [48] High dimensional similarity joins: Algorithms and performance evaluation
    Koudas, N
    Sevcik, KC
    14TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1998, : 466 - 475
  • [49] Efficient Graph Similarity Joins with Edit Distance Constraints
    Zhao, Xiang
    Xiao, Chuan
    Lin, Xuemin
    Wang, Wei
    2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 834 - 845
  • [50] Generalizing prefix filtering to improve set similarity joins
    Ribeiro, Leonardo Andrade
    Haerder, Theo
    INFORMATION SYSTEMS, 2011, 36 (01) : 62 - 78