Approximate String Joins with Abbreviations

被引:14
|
作者
Tao, Wenbo [1 ]
Deng, Dong [1 ]
Stonebraker, Michael [1 ]
机构
[1] MIT, Cambridge, MA 02139 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2017年 / 11卷 / 01期
关键词
SIMILARITY JOINS; EFFICIENT ALGORITHM;
D O I
10.14778/3151113.3151118
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String joins have wide applications in data integration and cleaning. The inconsistency of data caused by data errors, term variations and missing values has led to the need for approximate string joins (ASJ). In this paper, we study ASJ with abbreviations, which are a frequent type of term variation. Although prior works have studied ASJ given a user-inputted dictionary of synonym rules, they have three common limitations. First, they suffer from low precision in the presence of abbreviations having multiple full forms. Second, their join algorithms are not scalable due to the exponential time complexity. Third, the dictionary may not exist since abbreviations are highly domain-dependent. We propose an end-to-end workflow to address these limitations. There are three main components in the workflow: (1) a new similarity measure taking abbreviations into account that can handle abbreviations having multiple full forms, (2) an efficient join algorithm following the filter-verification framework and (3) an unsupervised approach to learn a dictionary of abbreviation rules from input strings. We evaluate our workflow on four real-world datasets and show that our workflow outputs accurate join results, scales well as input size grows and greatly outperforms state-of-the-art approaches in both accuracy and efficiency.
引用
收藏
页码:53 / 65
页数:13
相关论文
共 50 条
  • [1] Approximate string joins
    Srivastava, D
    SSDBM 2002: 15TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2003, : 7 - 7
  • [2] Approximate Joins for XML Using g-String
    Li, Fei
    Wang, Hongzhi
    Zhang, Cheng
    Hao, Liang
    Li, Jianzhong
    Gao, Hong
    DATABASE AND XML TECHNOLOGIES, 2010, 6309 : 3 - 17
  • [3] String Joins with Synonyms
    Song, Gwangho
    Lee, Hongrae
    Shim, Kyuseok
    Park, Yoonjae
    Kim, Wooyeol
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2020), PT III, 2020, 12114 : 389 - 405
  • [4] ApproxJoin: Approximate Distributed Joins
    Do Le Quoc
    Akkus, Istemi Ekin
    Bhatotia, Pramod
    Blanas, Spyros
    Chen, Ruichuan
    Fetzer, Christof
    Strufe, Thorsten
    PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 426 - 438
  • [5] C-STRING COMPARER HANDLES ABBREVIATIONS
    RANKIN, D
    EDN, 1989, 34 (08) : 214 - &
  • [6] String Similarity Joins: An Experimental Evaluation
    Jiang, Yu
    Li, Guoliang
    Feng, Jianhua
    Li, Wen-Syan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (08): : 625 - 636
  • [7] Approximate Geospatial Joins with Precision Guarantees
    Kipf, Andreas
    Lang, Harald
    Pandey, Varun
    Persa, Raul Alexandru
    Boncz, Peter
    Neumann, Thomas
    Kemper, Alfons
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 1360 - 1363
  • [8] Approximate joins for XML at label level
    Li, Fei
    Wang, Hongzhi
    Hao, Liang
    Li, Jianzhong
    Gao, Hong
    INFORMATION SCIENCES, 2014, 282 : 237 - 249
  • [9] Approximate String Processing
    Hadjieleftheriou, Marios
    Srivastava, Divesh
    FOUNDATIONS AND TRENDS IN DATABASES, 2009, 2 (04): : 267 - 402
  • [10] APPROXIMATE STRING MATCHING
    HALL, PAV
    DOWLING, GR
    COMPUTING SURVEYS, 1980, 12 (04) : 381 - 402