Approximate String Joins with Abbreviations

被引:14
|
作者
Tao, Wenbo [1 ]
Deng, Dong [1 ]
Stonebraker, Michael [1 ]
机构
[1] MIT, Cambridge, MA 02139 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2017年 / 11卷 / 01期
关键词
SIMILARITY JOINS; EFFICIENT ALGORITHM;
D O I
10.14778/3151113.3151118
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String joins have wide applications in data integration and cleaning. The inconsistency of data caused by data errors, term variations and missing values has led to the need for approximate string joins (ASJ). In this paper, we study ASJ with abbreviations, which are a frequent type of term variation. Although prior works have studied ASJ given a user-inputted dictionary of synonym rules, they have three common limitations. First, they suffer from low precision in the presence of abbreviations having multiple full forms. Second, their join algorithms are not scalable due to the exponential time complexity. Third, the dictionary may not exist since abbreviations are highly domain-dependent. We propose an end-to-end workflow to address these limitations. There are three main components in the workflow: (1) a new similarity measure taking abbreviations into account that can handle abbreviations having multiple full forms, (2) an efficient join algorithm following the filter-verification framework and (3) an unsupervised approach to learn a dictionary of abbreviation rules from input strings. We evaluate our workflow on four real-world datasets and show that our workflow outputs accurate join results, scales well as input size grows and greatly outperforms state-of-the-art approaches in both accuracy and efficiency.
引用
收藏
页码:53 / 65
页数:13
相关论文
共 50 条
  • [21] Spatial Approximate String Matching
    Katsumata, Akifumi
    Miura, Takao
    2009 IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING, VOLS 1 AND 2, 2009, : 123 - 128
  • [22] Approximate String Matching with SIMD
    Fiori, Fernando J.
    Pakalen, Waltteri
    Tarhio, Jorma
    COMPUTER JOURNAL, 2022, 65 (06): : 1472 - 1488
  • [23] Multiple approximate string matching
    BaezaYates, R
    Navarro, G
    ALGORITHMS AND DATA STRUCTURES, 1997, 1272 : 174 - 184
  • [24] Faster Approximate String Matching
    R. Baeza-Yates and G. Navarro
    Algorithmica, 1999, 23 : 127 - 158
  • [25] FAST APPROXIMATE STRING MATCHING
    OWOLABI, O
    MCGREGOR, DR
    SOFTWARE-PRACTICE & EXPERIENCE, 1988, 18 (04): : 387 - 393
  • [26] Tries for approximate string matching
    Shang, H
    Merrettal, TH
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1996, 8 (04) : 540 - 547
  • [27] Spatial Approximate String Search
    Li, Feifei
    Yao, Bin
    Tang, Mingwang
    Hadjieleftheriou, Marios
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (06) : 1394 - 1409
  • [28] Integrating XML data sources using approximate joins
    Guha, Sudipto
    Jagadish, H. V.
    Koudas, Nick
    Srivastava, Divesh
    Yu, Ting
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2006, 31 (01): : 161 - 207
  • [29] Faster Filters for Approximate String Matching
    Karkkainen, Juha
    Na, Joong Chae
    PROCEEDINGS OF THE NINTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FOURTH WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS, 2007, : 84 - 90
  • [30] AN IMPROVED ALGORITHM FOR APPROXIMATE STRING MATCHING
    GALIL, Z
    PARK, K
    SIAM JOURNAL ON COMPUTING, 1990, 19 (06) : 989 - 999