Extracting Syntactic Patterns from Databases

被引:5
|
作者
Ilyas, Andrew [1 ]
da Trindade, Joana M. F. [1 ]
Fernandez, Raul Castro [1 ]
Madden, Samuel [1 ]
机构
[1] MIT, CSAIL, Cambridge, MA 02139 USA
关键词
D O I
10.1109/ICDE.2018.00014
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many database columns contain string or numerical data that conforms to a pattern, such as phone numbers, dates, addresses, product identifiers, and employee ids. These patterns are useful in a number of data processing applications, including understanding what a specific field represents when field names are ambiguous, identifying outlier values, and finding similar fields across data sets. One way to express such patterns would be to learn regular expressions for each field in the database. Unfortunately, existing techniques on regular expression learning are slow, taking hundreds of seconds for columns of just a few thousand values. In contrast, we develop XSYSTEM, an efficient method to learn patterns over database columns in significantly less time. We show that these patterns can not only be built quickly, but are expressive enough to capture a number of key applications, including detecting outliers, measuring column similarity, and assigning semantic labels to columns (based on a library of regular expressions). We evaluate these applications with datasets that range from chemical databases (based on a collaboration with a pharmaceutical company), our university data warehouse, and open data from MassData.gov.
引用
收藏
页码:41 / 52
页数:12
相关论文
共 50 条
  • [1] Extracting Sequential Patterns from Progressive Databases: A Weighted Approach
    Mhatre, Amruta
    Verma, Mridula
    Toshniwal, Durga
    [J]. PROCEEDINGS OF THE 2009 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2009, : 788 - 792
  • [2] Extracting Protein Interactions from Biological Literatures Using Syntactic Tree Patterns
    Choi, Yong Suk
    [J]. INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2010, 13 (05): : 1807 - 1822
  • [3] Extracting recent weighted-based patterns from uncertain temporal databases
    Gan, Wensheng
    Lin, Jerry Chun-Wei
    Fournier-Viger, Philippe
    Chao, Han-Chieh
    Wu, Jimmy Ming-Tai
    Zhan, Justin
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2017, 61 : 161 - 172
  • [4] Extracting semantic relations from Portuguese corpora using lexical-syntactic patterns
    Amaro, Raquel
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3001 - 3005
  • [5] Extracting Causal Knowledge Using Clue Phrases and Syntactic Patterns
    Sakaji, Hiroki
    Sekine, Satoshi
    Masuyama, Shigeru
    [J]. PRACTICAL ASPECTS OF KNOWLEDGE MANAGEMENT, PROCEEDINGS, 2008, 5345 : 111 - +
  • [6] EXTRACTING INFORMATION FROM ENDGAME DATABASES
    NUNN, J
    [J]. ICCA JOURNAL, 1993, 16 (04): : 191 - 200
  • [7] Extracting causal nets from databases
    Hinde, CJ
    [J]. DEVELOPMENTS IN APPLIED ARTIFICIAL INTELLIGENCE, 2003, 2718 : 166 - 175
  • [8] EXTRACTING KNOWLEDGE FROM DIAGNOSTIC DATABASES
    UTHURUSAMY, R
    MEANS, LG
    GODDEN, KS
    LYTINEN, SL
    [J]. IEEE EXPERT-INTELLIGENT SYSTEMS & THEIR APPLICATIONS, 1993, 8 (06): : 27 - 38
  • [9] Extracting Interesting Patterns from E-commerce Databases to Ensure Customer Loyalty
    Dlamini, Mbuso Gerald
    Huang, Yo-Ping
    Zwane, Thanduxolo Shannon
    Dlamini, Siphamandla
    An, Nico
    [J]. 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC), 2015, : 382 - 387
  • [10] Extracting ontologies from relational databases
    Astrova, I
    [J]. PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON DATABASES AND APPLICATIONS, 2004, : 56 - 61