Extracting Syntactic Patterns from Databases

被引:6
|
作者
Ilyas, Andrew [1 ]
da Trindade, Joana M. F. [1 ]
Fernandez, Raul Castro [1 ]
Madden, Samuel [1 ]
机构
[1] MIT, CSAIL, Cambridge, MA 02139 USA
关键词
D O I
10.1109/ICDE.2018.00014
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many database columns contain string or numerical data that conforms to a pattern, such as phone numbers, dates, addresses, product identifiers, and employee ids. These patterns are useful in a number of data processing applications, including understanding what a specific field represents when field names are ambiguous, identifying outlier values, and finding similar fields across data sets. One way to express such patterns would be to learn regular expressions for each field in the database. Unfortunately, existing techniques on regular expression learning are slow, taking hundreds of seconds for columns of just a few thousand values. In contrast, we develop XSYSTEM, an efficient method to learn patterns over database columns in significantly less time. We show that these patterns can not only be built quickly, but are expressive enough to capture a number of key applications, including detecting outliers, measuring column similarity, and assigning semantic labels to columns (based on a library of regular expressions). We evaluate these applications with datasets that range from chemical databases (based on a collaboration with a pharmaceutical company), our university data warehouse, and open data from MassData.gov.
引用
收藏
页码:41 / 52
页数:12
相关论文
共 50 条
  • [21] Syntactic Rules of Extracting Test Cases from Software Requirements
    Masuda, Satoshi
    Matsuodani, Tohru
    Tsuda, Kazuhiko
    PROCEEDINGS OF THE 2016 8TH INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT AND ENGINEERING (ICIME 2016), 2016, : 12 - 17
  • [22] Extracting Fuzzy Summaries from NoSQL Graph Databases
    Castelltort, Arnaud
    Laurent, Anne
    FLEXIBLE QUERY ANSWERING SYSTEMS 2015, 2016, 400 : 189 - 200
  • [23] Expert system for extracting syntactic information from Java code
    Department of Computer Science, University of West Indies, Cave Hill Campus, P.O. Box 64, Bridgetown, Barbados
    1600, 187-198 (August 2003):
  • [24] Extracting Troubles from Daily Reports based on Syntactic Pieces
    Yoshifumi, Kakimoto
    Yamamoto, Kazuhide
    PACLIC 22: PROCEEDINGS OF THE 22ND PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, 2008, : 411 - 417
  • [25] Extracting and Analyzing Hidden Graphs from Relational Databases
    Xirogiannopoulos, Konstantinos
    Deshpande, Amol
    SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 897 - 912
  • [26] A social network analysis based approach to extracting knowledge patterns about innovation geography from patent databases
    Ferrara, Massimiliano
    Fosso, Diego
    Lanata, Davide
    Mavilia, Roberto
    Ursino, Domenico
    INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2018, 10 (01) : 23 - 72
  • [27] Expert system for extracting syntactic information from Java']Java code
    Depradine, C
    EXPERT SYSTEMS WITH APPLICATIONS, 2003, 25 (02) : 187 - 198
  • [28] EXTRACTING KNOWLEDGE FROM LARGE MEDICAL DATABASES - AN AUTOMATED APPROACH
    BOHREN, BF
    HADZIKADIC, M
    HANLEY, EN
    COMPUTERS AND BIOMEDICAL RESEARCH, 1995, 28 (03): : 191 - 210
  • [29] Extracting knowledge from fuzzy relational databases with description logic
    Ma, Z. M.
    Zhang, Fu
    Yan, Li
    Cheng, Jingwei
    INTEGRATED COMPUTER-AIDED ENGINEERING, 2011, 18 (02) : 181 - 200
  • [30] EXTRACTING NEWS FROM SERVER SIDE DATABASES BY QUERY INTERFACES
    Han, Hao
    JOURNAL OF COMPUTER INFORMATION SYSTEMS, 2014, 54 (02) : 57 - 65