Parametric schema inference for massive JSON datasets

被引:1
|
作者
Mohamed-Amine Baazizi
Dario Colazzo
Giorgio Ghelli
Carlo Sartiani
机构
[1] Sorbonne Université,CNRS, Laboratoire d’Informatique de Paris 6
[2] PSL Research University,CNRS, LAMSADE
[3] Università di Pisa, Université Paris Dauphine
[4] Università della Basilicata,Dipartimento di Informatica
来源
The VLDB Journal | 2019年 / 28卷
关键词
JSON; Schema inference; Map-reduce; Spark; Big data collections;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.
引用
收藏
页码:497 / 521
页数:24
相关论文
共 50 条
  • [21] JSON']JSON: Data model, Query languages and Schema specification
    Bourhis, Pierre
    Reutter, Juan L.
    Suarez, Fernando
    Vrgoc, Domagoj
    PODS'17: PROCEEDINGS OF THE 36TH ACM SIGMOD-SIGACT-SIGAI SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, 2017, : 123 - 135
  • [22] Streaming CityJSON']JSON datasets
    Ledoux, Hugo
    Stavropoulou, Gina
    Dukai, Balazs
    19TH 3D GEOINFO CONFERENCE 2024, VOL. 48-4, 2024, : 57 - 63
  • [23] Knowledge Acquisition System based on JSON']JSON Schema for Electrophysiological Actuation
    da Costa, Nuno M. C.
    Araujo, Tiago
    Nunes, Neuza
    Gamboa, Hugo
    E-BUSINESS AND TELECOMMUNICATIONS, ICETE 2012, 2014, 455 : 284 - 302
  • [24] Schema-Based JSON']JSON Data Stores in Relational Databases
    Irshad, Lubna
    Yan, Li
    Ma, Zongmin
    JOURNAL OF DATABASE MANAGEMENT, 2019, 30 (03) : 38 - 70
  • [25] HAJPAQUE: Hardware Accelerator for JSON']JSON Parsing, Querying and Schema Validation
    Agarwal, Samiksha
    Sarangi, Smruti R.
    2022 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2022), 2022, : 1 - 7
  • [26] pyJSON']JSON Schema Loader and JSON']JSON Editor: A tool for file-based metadata management
    Plathe, Nick
    Becker, Markus M.
    Franke, Steffen
    SOFTWAREX, 2024, 28
  • [27] JSON']JSON Data Management - Supporting Schema-less Development in RDBMS
    Liu, Zhen Hua
    Hammerschmidt, Beda
    McMahon, Doug
    SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 1247 - 1258
  • [28] An Empirical Study on the "Usage of Not" in Real-World JSON']JSON Schema Documents
    Baazizi, Mohamed-Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    Scherzinger, Stefanie
    CONCEPTUAL MODELING, ER 2021, 2021, 13011 : 102 - 112
  • [29] Temporal JSON schema versioning in the TJSchema framework
    1600, Digital Information Research Foundation, 11 Ramanujam Street, T.Nagar,, Chennai, 600017, India (15):
  • [30] Translating JSON']JSON Data into Relational Data Using Schema-oblivious Approaches
    Bahta, Rahwa
    Atay, Mustafa
    PROCEEDINGS OF THE 2019 ANNUAL ACM SOUTHEAST CONFERENCE (ACMSE 2019), 2019, : 233 - 236