JSON']JSON document clustering based on schema embeddings

被引:1
|
作者
Priya, D. Uma [1 ]
Thilagam, P. Santhi [1 ]
机构
[1] Natl Inst Technol Karnataka, Dept Comp Sci & Engn, Surathkal, India
关键词
Clustering; contextual similarity; deep autoencoders; embeddings; !text type='JSON']JSON[!/text; CLASSIFICATION;
D O I
10.1177/01655515221116522
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The growing popularity of JSON as the data storage and interchange format increases the availability of massive multi-structured data collections. Clustering JSON documents has become a significant issue in organising large data collections. Existing research uses various structural similarity measures to perform clustering. However, differently annotated JSON structures may also encode semantic relatedness, necessitating the use of both syntactic and semantic properties of heterogeneous JSON schemas. Using the SchemaEmbed model, this paper proposes an embedding-based clustering approach for grouping contextually similar JSON documents. The SchemaEmbed model is designed using the pre-trained Word2Vec model and a deep autoencoder that considers both syntactic and semantic information of JSON schemas for clustering the documents. The Word2Vec model learns the attribute embeddings, and a deep autoencoder is designed to generate context-aware schema embeddings. Finally, the context-based similar JSON documents are grouped using a clustering algorithm. The effectiveness of the proposed work is evaluated using both real and synthetic datasets. The results and findings show that the proposed approach improves clustering quality significantly, with a high NMI score of 75%. In addition, we demonstrate that clustering results obtained by contextual similarity are superior to those obtained by traditional semantic similarity models.
引用
收藏
页码:1112 / 1130
页数:19
相关论文
共 50 条
  • [1] An Approach for Schema Extraction of JSON']JSON and Extended JSON']JSON Document Collections
    Frozza, Angelo Augusto
    Mello, Ronaldo dos Santos
    da Costa, Felipe de Souza
    2018 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2018, : 356 - 363
  • [2] Foundations of JSON']JSON Schema
    Pezoa, Felipe
    Reutter, Juan L.
    Suarez, Fernando
    Ugarte, Martin
    Vrgoc, Domagoj
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16), 2016, : 263 - 273
  • [3] A web service based on RESTful API and JSON']JSON Schema/JSON']JSON Meta Schema to construct knowledge graphs
    Agocs, Adam
    Le Goff, Jean-Marie
    2018 INTERNATIONAL CONFERENCE ON COMPUTER, INFORMATION AND TELECOMMUNICATION SYSTEMS (IEEE CITS 2018), 2018, : 167 - 171
  • [4] Json']JsonCurer: Data Quality Management for JSON']JSON Based on an Aggregated Schema
    Xiong, Kai
    Xu, Xinyi
    Fu, Siwei
    Weng, Di
    Wang, Yongheng
    Wu, Yingcai
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (06) : 3008 - 3021
  • [5] Leveraging Structural and Semantic Measures for JSON']JSON Document Clustering
    Priya, D. Uma
    Thilagam, P. Santhi
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2023, 29 (03) : 222 - 241
  • [6] JSON']JSON Schema Inference Approaches
    Contos, Pavel
    Svoboda, Martin
    ADVANCES IN CONCEPTUAL MODELING, ER 2020, 2020, 12584 : 173 - 183
  • [7] Witness Generation for JSON']JSON Schema
    Attouche, Lyes
    Baazizi, Mohamed-Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    Scherzinger, Stefanie
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 15 (13): : 4002 - 4014
  • [8] Nested Schema Mappings for Integrating JSON']JSON
    Hai, Rihan
    Quix, Christoph
    Kensche, David
    CONCEPTUAL MODELING, ER 2018, 2018, 11157 : 397 - 405
  • [9] Negation-closure for JSON']JSON Schema
    Baazizi, Mohamed -Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    Scherzinger, Stefanie
    THEORETICAL COMPUTER SCIENCE, 2023, 955
  • [10] JSON']JSON Schema Matching: Empirical Observations
    Waghray, Kunal
    SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, : 2887 - 2889