Schema profiling of document-oriented databases

被引:42
|
作者
Gallinucci, Enrico [1 ,2 ]
Golfarelli, Matteo [1 ,2 ]
Rizzi, Stefano [1 ,2 ]
机构
[1] Univ Bologna, DISI, Viale Risorgimento 2, I-40136 Bologna, Italy
[2] CINI, Via Solaria 113, I-00198 Rome, Italy
基金
欧盟地平线“2020”;
关键词
NoSQL; Document-oriented databases; Schema discovery; Decision trees;
D O I
10.1016/j.is.2018.02.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In document-oriented databases, schema is a soft concept and the documents in a collection can be stored using different local schemata. This gives designers and implementers augmented flexibility; however, it requires an extra effort to understand the rules that drove the use of alternative schemata when sets of documents with different -and possibly conflicting- schemata are to be analyzed or integrated. In this paper we propose a technique, called schema profiling, to explain the schema variants within a collection in document-oriented databases by capturing the hidden rules explaining the use of these variants. We express these rules in the form of a decision tree (schema profile). Consistently with the requirements we elicited from real users, we aim at creating explicative, precise, and concise schema profiles. The algorithm we adopt to this end is inspired by the well-known C4.5 classification algorithm and builds on two original features: the coupling of value-based and schema-based conditions within schema profiles, and the introduction of a novel measure of entropy to assess the quality of a schema profile. A set of experimental tests made on both synthetic and real datasets demonstrates the effectiveness and efficiency of our approach. (C) 2018 Elsevier Ltd. All rights reserved.
引用
收藏
页码:13 / 25
页数:13
相关论文
共 50 条
  • [31] UML4NOSQL: A NOVEL APPROACH FOR MODELING NOSQL DOCUMENT-ORIENTED DATABASES BASED ON UML
    Maicha, Mohammed ElHabib
    Ouinten, Youcef
    Ziani, Benameur
    [J]. COMPUTING AND INFORMATICS, 2022, 41 (03) : 813 - 833
  • [32] A data replication strategy for document-oriented NoSQL systems
    Tabet, Khaoula
    Mokadem, Riad
    Laouar, Mohamed Ridda
    [J]. INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2019, 10 (01) : 53 - 62
  • [33] Data Modeling for Analytical Queries on Document-Oriented DBMS
    Soransso, R. A. S. N.
    Cavalcanti, M. C.
    [J]. 33RD ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, 2018, : 541 - 548
  • [34] Document-oriented development of content-intensive applications
    Sierra, JL
    Fernández-Manjón, B
    Fernández-Valmayor, A
    Navarro, A
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2005, 15 (06) : 975 - 993
  • [35] A document-oriented approach to the development of knowledge based systems
    Sierra, JL
    Fernández-Manjón, B
    Fernández-Valmayor, A
    Navarro, A
    [J]. CURRENT TOPICS IN ARTIFICIAL INTELLIGENCE, 2004, 3040 : 16 - 25
  • [36] Document-Oriented Data Warehouses: Complex Hierarchies and Summarizability
    Chevalier, Max
    El Malki, Mohammed
    Kopliku, Arlind
    Teste, Olivier
    Tournier, Ronan
    [J]. ADVANCES IN UBIQUITOUS NETWORKING 2, 2017, 397 : 671 - 683
  • [37] Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
    Abadji, Julien
    Suarez, Pedro Ortiz
    Romary, Laurent
    Sagot, Benoit
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4344 - 4355
  • [39] Document-Oriented Middleware: The Way to High-Quality Software
    Kral, Jaroslav
    Pitner, Tomas
    Zemlicka, Michal
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2017, PT V, 2017, 10408 : 607 - 619
  • [40] AStar: A modeling language for document-oriented geospatial data warehouses
    Ferro, Marcio
    Silva, Edson
    Fidalgo, Robson
    [J]. DATA & KNOWLEDGE ENGINEERING, 2023, 145