Parametric schema inference for massive JSON datasets

被引:1
|
作者
Mohamed-Amine Baazizi
Dario Colazzo
Giorgio Ghelli
Carlo Sartiani
机构
[1] Sorbonne Université,CNRS, Laboratoire d’Informatique de Paris 6
[2] PSL Research University,CNRS, LAMSADE
[3] Università di Pisa, Université Paris Dauphine
[4] Università della Basilicata,Dipartimento di Informatica
来源
The VLDB Journal | 2019年 / 28卷
关键词
JSON; Schema inference; Map-reduce; Spark; Big data collections;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.
引用
收藏
页码:497 / 521
页数:24
相关论文
共 50 条
  • [41] Massive datasets
    Kettenring, Jon R.
    WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2009, 1 (01) : 25 - 32
  • [42] JS']JS4Geo: a canonical JSON']JSON Schema for geographic data suitable to NoSQL databases
    Frozza, Angelo A.
    Mello, Ronaldo dos S.
    GEOINFORMATICA, 2020, 24 (04) : 987 - 1019
  • [43] ReCG: Bottom-Up JSON']JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework
    Yun, Joohyung
    Tak, Byungchul
    Han, Wook-Shin
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (11): : 3538 - 3550
  • [44] Using JSON']JSON Schema to Define a Systems Modeling Vocabulary: The Tradespace Analysis Tool for Constellations (TAT-C)
    Grogan, Paul T.
    Tapia, Josue I.
    PROCEEDINGS OF THE 2023 CONFERENCE ON SYSTEMS ENGINEERING RESEARCH, CSER 2023, 2024, : 47 - 65
  • [45] Analyzing embedded semantic with JSON']JSON-LD and Microdata for educational resources in large scale web datasets
    Navarrete, Rosa
    Recalde, Lorena
    Montenegro, Carlos
    Lujan-Mora, Sergio
    2019 6TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI 2019), 2019, : 1133 - 1138
  • [46] INFERENCE AND SCHEMA - AN ETHNOGRAPHIC VIEW
    AGAR, MH
    HUMAN STUDIES, 1983, 6 (01) : 53 - 66
  • [47] Mining of massive datasets
    Rajaraman, Anand
    Ullman, Jeffrey David
    Mining of Massive Datasets, 2011, 9781107015357 : 1 - 315
  • [48] Mining of Massive Datasets
    Richter, Lothar
    BIOMETRICS, 2018, 74 (04) : 1520 - 1521
  • [49] DOE and Massive Datasets
    不详
    JOURNAL OF NUCLEAR MEDICINE, 2012, 53 (06) : 26N - 26N
  • [50] Workshop on Massive Datasets
    Wren, Christopher R.
    Ivanov, Yuri A.
    ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, 2007, : 385 - 385