Parametric schema inference for massive JSON datasets

被引:1
|
作者
Mohamed-Amine Baazizi
Dario Colazzo
Giorgio Ghelli
Carlo Sartiani
机构
[1] Sorbonne Université,CNRS, Laboratoire d’Informatique de Paris 6
[2] PSL Research University,CNRS, LAMSADE
[3] Università di Pisa, Université Paris Dauphine
[4] Università della Basilicata,Dipartimento di Informatica
来源
The VLDB Journal | 2019年 / 28卷
关键词
JSON; Schema inference; Map-reduce; Spark; Big data collections;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.
引用
收藏
页码:497 / 521
页数:24
相关论文
共 50 条
  • [31] LiteIndex: Memory-Efficient Schema-Agnostic Indexing for JSON']JSON documents in SQLite
    Shang, Siqi
    Wu, Qihong
    Wang, Tianyu
    Shao, Zili
    2021 26TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC), 2021, : 435 - 440
  • [32] A JSON']JSON Token-Based Authentication and Access Management Schema for Cloud SaaS Applications
    Ethelbert, Obinna
    Moghaddam, Faraz Fatemi
    Wieder, Philipp
    Yahyapour, Ramin
    2017 IEEE 5TH INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD 2017), 2017, : 47 - 53
  • [33] Providing Research Graph Data in JSON']JSON-LD Using Schema.org
    Wang, Jingbo
    Aryani, Amir
    Wyborn, Lesley
    Evans, Ben
    WWW'17 COMPANION: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2017, : 1213 - 1218
  • [34] Challenges in Checking JSON']JSON Schema Containment over Evolving Real-World Schemas
    Fruth, Michael
    Baazizi, Mohamed-Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    Scherzinger, Stefanie
    ADVANCES IN CONCEPTUAL MODELING, ER 2020, 2020, 12584 : 220 - 230
  • [35] Model checking for parametric single-index models with massive datasets
    Yang, Xin
    Yan, Qijing
    Wu, Mixia
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2023, 227 : 129 - 145
  • [36] Translating JSON']JSON Schema logics into OWL axioms for unified data validation on a digital manufacturing platform
    Cheong, Hyunmin
    7TH INTERNATIONAL CONFERENCE ON CHANGEABLE, AGILE, RECONFIGURABLE AND VIRTUAL PRODUCTION (CARV2018), 2019, 28 : 183 - 188
  • [37] LEI2JSON']JSON: Schema-based validation and conversion of livestock event information
    Habib, Mahir
    Kabir, Muhammad Ashad
    Zheng, Lihong
    SOFTWAREX, 2024, 26
  • [38] Meta-Kriging: Scalable Bayesian Modeling and Inference for Massive Spatial Datasets
    Guhaniyogi, Rajarshi
    Banerjee, Sudipto
    TECHNOMETRICS, 2018, 60 (04) : 430 - 444
  • [39] Bayesian Inference in Common Microeconometric Models With Massive Datasets by Double Marginalized Subsampling
    Qian, Hang
    JOURNAL OF BUSINESS & ECONOMIC STATISTICS, 2022, 40 (04) : 1484 - 1497
  • [40] HAJPAQUE: Hardware Accelerator for JSON Parsing, Querying and Schema Validation
    Agarwal, Samiksha
    Sarangi, Smruti R.
    Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI, 2022, 2022-July : 1 - 7