Approximate String Joins with Abbreviations

被引:14
|
作者
Tao, Wenbo [1 ]
Deng, Dong [1 ]
Stonebraker, Michael [1 ]
机构
[1] MIT, Cambridge, MA 02139 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2017年 / 11卷 / 01期
关键词
SIMILARITY JOINS; EFFICIENT ALGORITHM;
D O I
10.14778/3151113.3151118
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String joins have wide applications in data integration and cleaning. The inconsistency of data caused by data errors, term variations and missing values has led to the need for approximate string joins (ASJ). In this paper, we study ASJ with abbreviations, which are a frequent type of term variation. Although prior works have studied ASJ given a user-inputted dictionary of synonym rules, they have three common limitations. First, they suffer from low precision in the presence of abbreviations having multiple full forms. Second, their join algorithms are not scalable due to the exponential time complexity. Third, the dictionary may not exist since abbreviations are highly domain-dependent. We propose an end-to-end workflow to address these limitations. There are three main components in the workflow: (1) a new similarity measure taking abbreviations into account that can handle abbreviations having multiple full forms, (2) an efficient join algorithm following the filter-verification framework and (3) an unsupervised approach to learn a dictionary of abbreviation rules from input strings. We evaluate our workflow on four real-world datasets and show that our workflow outputs accurate join results, scales well as input size grows and greatly outperforms state-of-the-art approaches in both accuracy and efficiency.
引用
收藏
页码:53 / 65
页数:13
相关论文
共 50 条
  • [41] A Consensus Algorithm for Approximate String Matching
    Rubio, Miguel
    Alba, Alfonso
    Mendez, Martin
    Arce-Santana, Edgar
    Rodriguez-Kessler, Margarita
    3RD IBEROAMERICAN CONFERENCE ON ELECTRONICS ENGINEERING AND COMPUTER SCIENCE, CIIECC 2013, 2013, 7 : 322 - 327
  • [42] Indexed Hierarchical Approximate String Matching
    Russo, Luis M. S.
    Navarro, Gonzalo
    Oliveira, Arlindo L.
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2008, 5280 : 144 - +
  • [43] Fast approximate string matching in a dictionary
    Baeza-Yates, R
    Navarro, G
    STRING PROCESSING AND INFORMATION RETRIEVAL - PROCEEDINGS: A SOUTH AMERICAN SYMPOSIUM, 1998, : 14 - 22
  • [44] A parallel algorithm for approximate string matching
    Kaplan, K
    Burge, LL
    Garuba, M
    PDPTA'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-4, 2003, : 1844 - 1848
  • [45] Approximate string matching for music analysis
    Clifford, R
    Iliopoulos, C
    SOFT COMPUTING, 2004, 8 (09) : 597 - 603
  • [46] AN APPROXIMATE STRING-MATCHING ALGORITHM
    KIM, JY
    SHAWETAYLOR, J
    THEORETICAL COMPUTER SCIENCE, 1992, 92 (01) : 107 - 117
  • [47] Approximate string matching for music analysis
    R. Clifford
    C. Iliopoulos
    Soft Computing, 2004, 8 : 597 - 603
  • [48] Approximate String Matching with Reduced Alphabet
    Salmela, Leena
    Tarhio, Jorma
    ALGORITHMS AND APPLICATIONS: ESSAYS DEDICATED TO ESKO UKKONEN ON THE OCCASION OF HIS 60TH BIRTHDAY, 2010, 6060 : 210 - +
  • [49] Compressed Indexes for Approximate String Matching
    Chan, Ho-Leung
    Lam, Tak-Wah
    Sung, Wing-Kin
    Tam, Siu-Lung
    Wong, Swee-Seong
    ALGORITHMICA, 2010, 58 (02) : 263 - 281
  • [50] On approximate string matching of unique oligonucleotides
    Hyyrö, H
    Vihinen, M
    Juhola, M
    MEDINFO 2001: PROCEEDINGS OF THE 10TH WORLD CONGRESS ON MEDICAL INFORMATICS, PTS 1 AND 2, 2001, 84 : 960 - 964