Similarity Joins for Uncertain Strings

被引:0
|
作者
Patil, Manish [1 ]
Shah, Rahul [1 ]
机构
[1] Louisiana State Univ, Baton Rouge, LA 70803 USA
基金
美国国家科学基金会;
关键词
Uncertain strings; string joins; edit distance;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A string similarity join finds all similar string pairs between two input string collections. It is an essential operation in many applications, such as data integration and cleaning, and has been extensively studied for deterministic strings. Increasingly, many applications have to deal with imprecise strings or strings with fuzzy information in them. This work presents the first solution for answering similarity join queries over uncertain strings that implements possible-world semantics, using the edit distance as the measure of similarity. Given two collections of uncertain strings R, S, and input (k, tau), our task is to find string pairs (R, S) between collections such that Pr(ed(R, S) <= k) > tau i.e., the probability of the edit distance between R and S being at most k is more than probability threshold tau. We can address the join problem by obtaining all strings in S that are similar to each string R in R. However, existing solutions for answering such similarity search queries on uncertain string databases only support a deterministic string as input. Exploiting these solutions would require exponentially many possible worlds of R to be considered, which is not only ineffective but also prohibitively expensive. We propose various filtering techniques that give upper and (or) lower bound on Pr(ed(R, S) <= k) without instantiating possible worlds for either of the strings. We then incorporate these techniques into an indexing scheme and significantly reduce the filtering overhead. Further, we alleviate the verification cost of a string pair that survives pruning by using a trie structure which allows us to overlap the verification cost of exponentially many possible instances of the candidate string pair. Finally, we evaluate the effectiveness of the proposed approach by thorough practical experimentation.
引用
收藏
页码:1471 / 1482
页数:12
相关论文
共 50 条
  • [1] Scalable Similarity Joins of Tokenized Strings
    Metwally, Ahmed
    Huang, Chun-Heng
    2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 1766 - 1777
  • [2] Compact similarity joins
    Bryan, Brent
    Eberhardt, Frederick
    Faloutsos, Christos
    2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, : 346 - +
  • [3] Diversity in Similarity Joins
    Santos, Lucio F. D.
    Carvalho, Luiz Olmes
    Oliveira, Willian D.
    Traina, Agma J. M.
    Traina, Caetano, Jr.
    SIMILARITY SEARCH AND APPLICATIONS, SISAP 2015, 2015, 9371 : 42 - 53
  • [4] Extending SPARQL with Similarity Joins
    Ferrada, Sebastian
    Bustos, Benjamin
    Hogan, Aidan
    SEMANTIC WEB - ISWC 2020, PT I, 2020, 12506 : 201 - 217
  • [5] Similarity joins and clustering for SPARQL
    Ferrada, Sebastian
    Bustos, Benjamin
    Hogan, Aidan
    SEMANTIC WEB, 2024, 15 (05) : 1701 - 1732
  • [6] Similarity Joins of Sparse Features
    Metwally, Ahmed
    Shum, Michael
    COMPANION OF THE 2024 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD-COMPANION 2024, 2024, : 80 - 92
  • [7] Metric space similarity joins
    Jacox, Edwin H.
    Samet, Hanan
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2008, 33 (02):
  • [8] Efficient Metric Indexing for Similarity Search and Similarity Joins
    Chen, Lu
    Gao, Yunjun
    Li, Xinhan
    Jensen, Christian S.
    Chen, Gang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 556 - 571
  • [9] Leveraging Similarity Joins for Signal Reconstruction
    Asudeh, Abolfazi
    Nazi, Azade
    Augustine, Jees
    Thirumuruganathan, Saravanan
    Zhang, Nan
    Das, Gautam
    Srivastava, Divesh
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (10): : 1276 - 1288
  • [10] Quicker Similarity Joins in Metric Spaces
    Fredriksson, Kimmo
    Braithwaite, Billy
    SIMILARITY SEARCH AND APPLICATIONS (SISAP), 2013, 8199 : 127 - 140