On-the-Fly Token Similarity Joins in Relational Databases

被引:3
|
作者
Augsten, Nikolaus [1 ]
Miraglia, Armando [2 ]
Neumann, Thomas [3 ]
Kemper, Alfons [3 ]
机构
[1] Univ Salzburg, Salzburg, Austria
[2] Vrije Univ Amsterdam, Amsterdam, Netherlands
[3] Tech Univ Munich, Munich, Germany
关键词
GRAMS; EDIT;
D O I
10.1145/2588555.2610530
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Token similarity joins represent data items as sets of tokens, for example, strings are represented as sets of q-grams (sub-strings of length q). Two items are considered similar and match if their token sets have a large overlap. Previous work on similarity joins in databases mainly focuses on expressing the overlap computation with relational operators. The tokens are assumed to preexist in the database, and the token generation cannot be expressed as part of the query. Our goal is to efficiently compute token similarity joins on-the-fly, i.e., without any precomputed tokens or indexes. We define tokenize, a new relational operator that generates tokens and allows the similarity join to be fully integrated into relational databases. This allows us to (1) optimize the token generation as part of the query plan, (2) provide the query optimizer with cardinality estimates for tokens, (3) choose efficient algorithms based on the query context. We discuss algebraic properties, cardinality estimates, and an efficient iterator algorithm for tokenize. We implemented our operator in the kernel of PostgreSQL and empirically evaluated its performance for similarity joins.
引用
收藏
页码:1495 / 1506
页数:12
相关论文
共 50 条
  • [1] ON-THE-FLY READING OF ENTIRE DATABASES
    AMMANN, P
    JAJODIA, S
    MAVULURI, P
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1995, 7 (05) : 834 - 838
  • [2] THE STUDY OF JOINS IN FUZZY RELATIONAL DATABASES
    RAJU, KVSVN
    MAJUMDAR, AK
    [J]. FUZZY SETS AND SYSTEMS, 1987, 21 (01) : 19 - 34
  • [3] Querying relational databases without explicit joins
    Lawrence, R
    Barker, K
    [J]. CONCEPTUAL MODELING FOR NEW INFORMATION SYSTEMS TECHNOLOGIES, 2002, 2465 : 278 - 291
  • [4] Similarity Measures for Relational Databases
    Hajdinjak, Melita
    Bauer, Andrej
    [J]. INFORMATICA-JOURNAL OF COMPUTING AND INFORMATICS, 2009, 33 (02): : 135 - 141
  • [5] Similarity Measures for Pattern Matching On-the-fly
    Caluori, Ursina
    Simon, Klaus
    [J]. SIXTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2013), 2013, 9067
  • [6] On-The-Fly Data Integration Models for Biological Databases
    Naidu, Pavithra G.
    Palakal, Mathew J.
    Hartanto, Shielly
    [J]. APPLIED COMPUTING 2007, VOL 1 AND 2, 2007, : 118 - +
  • [7] On-the-Fly, Incremental, Consistent Reading of Entire Databases
    Pu, Calton
    [J]. ALGORITHMICA, 1986, 1 (1-4) : 271 - 287
  • [8] ON SIMILARITY RELATIONS IN FUZZY RELATIONAL DATABASES
    POTOCZNY, HB
    [J]. FUZZY SETS AND SYSTEMS, 1984, 12 (03) : 231 - 235
  • [9] Similarity-Based Classification in Relational Databases
    Honko, Piotr
    [J]. FUNDAMENTA INFORMATICAE, 2010, 101 (03) : 187 - 213
  • [10] JOINS AND SOLUTIONS OF THE PROJECTION SYNTHESIS PROBLEM IN RELATIONAL DATABASES .1.
    TENENBAUM, LA
    [J]. AUTOMATION AND REMOTE CONTROL, 1988, 49 (08) : 1094 - 1102