Parametric schema inference for massive JSON datasets

被引:1
|
作者
Mohamed-Amine Baazizi
Dario Colazzo
Giorgio Ghelli
Carlo Sartiani
机构
[1] Sorbonne Université,CNRS, Laboratoire d’Informatique de Paris 6
[2] PSL Research University,CNRS, LAMSADE
[3] Università di Pisa, Université Paris Dauphine
[4] Università della Basilicata,Dipartimento di Informatica
来源
The VLDB Journal | 2019年 / 28卷
关键词
JSON; Schema inference; Map-reduce; Spark; Big data collections;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.
引用
收藏
页码:497 / 521
页数:24
相关论文
共 50 条
  • [1] Parametric schema inference for massive JSON']JSON datasets
    Baazizi, Mohamed-Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    VLDB JOURNAL, 2019, 28 (04): : 497 - 521
  • [2] JSON']JSON Schema Inference Approaches
    Contos, Pavel
    Svoboda, Martin
    ADVANCES IN CONCEPTUAL MODELING, ER 2020, 2020, 12584 : 173 - 183
  • [3] Counting Types for Massive JSON']JSON Datasets
    Baazizi, Mohamed-Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    PROCEEDINGS OF THE 16TH INTERNATIONAL SYMPOSIUM ON DATABASE PROGRAMMING LANGUAGES (DBPL 2017), 2017,
  • [4] A Comparative Analysis of JSON']JSON Schema Inference Algorithms
    Lattak, Ivan Veinhardt
    Koupil, Pavel
    ENASE: PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON EVALUATION OF NOVEL APPROACHES TO SOFTWARE ENGINEERING, 2022, : 379 - 386
  • [5] Foundations of JSON']JSON Schema
    Pezoa, Felipe
    Reutter, Juan L.
    Suarez, Fernando
    Ugarte, Martin
    Vrgoc, Domagoj
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16), 2016, : 263 - 273
  • [6] Witness Generation for JSON']JSON Schema
    Attouche, Lyes
    Baazizi, Mohamed-Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    Scherzinger, Stefanie
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 15 (13): : 4002 - 4014
  • [7] A web service based on RESTful API and JSON']JSON Schema/JSON']JSON Meta Schema to construct knowledge graphs
    Agocs, Adam
    Le Goff, Jean-Marie
    2018 INTERNATIONAL CONFERENCE ON COMPUTER, INFORMATION AND TELECOMMUNICATION SYSTEMS (IEEE CITS 2018), 2018, : 167 - 171
  • [8] Nested Schema Mappings for Integrating JSON']JSON
    Hai, Rihan
    Quix, Christoph
    Kensche, David
    CONCEPTUAL MODELING, ER 2018, 2018, 11157 : 397 - 405
  • [9] Negation-closure for JSON']JSON Schema
    Baazizi, Mohamed -Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    Scherzinger, Stefanie
    THEORETICAL COMPUTER SCIENCE, 2023, 955
  • [10] JSON']JSON Schema Matching: Empirical Observations
    Waghray, Kunal
    SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, : 2887 - 2889