Similarity Joins for Uncertain Strings

被引：0

作者：

Patil, Manish ^{[1
]}

Shah, Rahul ^{[1
]}

机构：

[1] Louisiana State Univ, Baton Rouge, LA 70803 USA

来源：

SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2014年

基金：

美国国家科学基金会;

关键词：

Uncertain strings; string joins; edit distance;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A string similarity join finds all similar string pairs between two input string collections. It is an essential operation in many applications, such as data integration and cleaning, and has been extensively studied for deterministic strings. Increasingly, many applications have to deal with imprecise strings or strings with fuzzy information in them. This work presents the first solution for answering similarity join queries over uncertain strings that implements possible-world semantics, using the edit distance as the measure of similarity. Given two collections of uncertain strings R, S, and input (k, tau), our task is to find string pairs (R, S) between collections such that Pr(ed(R, S) <= k) > tau i.e., the probability of the edit distance between R and S being at most k is more than probability threshold tau. We can address the join problem by obtaining all strings in S that are similar to each string R in R. However, existing solutions for answering such similarity search queries on uncertain string databases only support a deterministic string as input. Exploiting these solutions would require exponentially many possible worlds of R to be considered, which is not only ineffective but also prohibitively expensive. We propose various filtering techniques that give upper and (or) lower bound on Pr(ed(R, S) <= k) without instantiating possible worlds for either of the strings. We then incorporate these techniques into an indexing scheme and significantly reduce the filtering overhead. Further, we alleviate the verification cost of a string pair that survives pruning by using a trie structure which allows us to overlap the verification cost of exponentially many possible instances of the candidate string pair. Finally, we evaluate the effectiveness of the proposed approach by thorough practical experimentation.

引用

页码：1471 / 1482

页数：12

共 50 条

[41] Top-k String Similarity Joins
Qi, Shuyao
Bouros, Panagiotis
Mamoulis, Nikos
PROCEEDINGS OF THE 32TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, SSDBM 2020, 2020,
[42] The similarity of two strings of fuzzy sets
Andrejková, G
KYBERNETIKA, 2000, 36 (06) : 671 - 687
[43] Similarity Joins: Their implementation and interactions with other database operators
Silva, Yasin N.
Pearson, Spencer S.
Chon, Jaime
Roberts, Ryan
INFORMATION SYSTEMS, 2015, 52 : 149 - 162
[44] EmbedJoin: Efficient Edit Similarity Joins via Embeddings
Zhang, Haoyu
Zhang, Qin
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 585 - 594
[45] Efficient Similarity Joins for Near-Duplicate Detection
Xiao, Chuan
Wang, Wei
Lin, Xuemin
Yu, Jeffrey Xu
Wang, Guoren
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2011, 36 (03):
[46] On-the-Fly Token Similarity Joins in Relational Databases
Augsten, Nikolaus
Miraglia, Armando
Neumann, Thomas
Kemper, Alfons
SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 1495 - 1506
[47] Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint
Xu, Pengfei
Lu, Jiaheng
CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, : 1563 - 1566
[48] High dimensional similarity joins: Algorithms and performance evaluation
Koudas, N
Sevcik, KC
14TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1998, : 466 - 475
[49] Efficient Graph Similarity Joins with Edit Distance Constraints
Zhao, Xiang
Xiao, Chuan
Lin, Xuemin
Wang, Wei
2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 834 - 845
[50] Generalizing prefix filtering to improve set similarity joins
Ribeiro, Leonardo Andrade
Haerder, Theo
INFORMATION SYSTEMS, 2011, 36 (01) : 62 - 78

← 1 2 3 4 5 →