JSON']JSON document clustering based on schema embeddings

被引:1
|
作者
Priya, D. Uma [1 ]
Thilagam, P. Santhi [1 ]
机构
[1] Natl Inst Technol Karnataka, Dept Comp Sci & Engn, Surathkal, India
关键词
Clustering; contextual similarity; deep autoencoders; embeddings; !text type='JSON']JSON[!/text; CLASSIFICATION;
D O I
10.1177/01655515221116522
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The growing popularity of JSON as the data storage and interchange format increases the availability of massive multi-structured data collections. Clustering JSON documents has become a significant issue in organising large data collections. Existing research uses various structural similarity measures to perform clustering. However, differently annotated JSON structures may also encode semantic relatedness, necessitating the use of both syntactic and semantic properties of heterogeneous JSON schemas. Using the SchemaEmbed model, this paper proposes an embedding-based clustering approach for grouping contextually similar JSON documents. The SchemaEmbed model is designed using the pre-trained Word2Vec model and a deep autoencoder that considers both syntactic and semantic information of JSON schemas for clustering the documents. The Word2Vec model learns the attribute embeddings, and a deep autoencoder is designed to generate context-aware schema embeddings. Finally, the context-based similar JSON documents are grouped using a clustering algorithm. The effectiveness of the proposed work is evaluated using both real and synthetic datasets. The results and findings show that the proposed approach improves clustering quality significantly, with a high NMI score of 75%. In addition, we demonstrate that clustering results obtained by contextual similarity are superior to those obtained by traditional semantic similarity models.
引用
收藏
页码:1112 / 1130
页数:19
相关论文
共 50 条
  • [11] Reducing Ambiguity in Json']Json Schema Discovery
    Spoth, William
    Kennedy, Oliver
    Lu, Ying
    Hammerschmidt, Beda
    Liu, Zhen Hua
    SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 1732 - 1744
  • [12] JSON']JSONDISCOVERER: Visualizing the schema lurking behind JSON']JSON documents
    Canovas Izquierdo, Javier Luis
    Cabot, Jordi
    KNOWLEDGE-BASED SYSTEMS, 2016, 103 : 52 - 55
  • [13] Knowledge Acquisition System based on JSON']JSON Schema for Electrophysiological Actuation
    da Costa, Nuno M. C.
    Araujo, Tiago
    Nunes, Neuza
    Gamboa, Hugo
    E-BUSINESS AND TELECOMMUNICATIONS, ICETE 2012, 2014, 455 : 284 - 302
  • [14] Schema-Based JSON']JSON Data Stores in Relational Databases
    Irshad, Lubna
    Yan, Li
    Ma, Zongmin
    JOURNAL OF DATABASE MANAGEMENT, 2019, 30 (03) : 38 - 70
  • [15] pyJSON']JSON Schema Loader and JSON']JSON Editor: A tool for file-based metadata management
    Plathe, Nick
    Becker, Markus M.
    Franke, Steffen
    SOFTWAREX, 2024, 28
  • [16] Definition of REST web services with JSON']JSON schema
    Barbaglia, Guido
    Murzilli, Simone
    Cudini, Stefano
    SOFTWARE-PRACTICE & EXPERIENCE, 2017, 47 (06): : 907 - 920
  • [17] Research on the Translation from XSD to JSON']JSON Schema
    Guo, Shijiao
    Xia, Hongxia
    Xiang, Guangli
    2017 IEEE 9TH INTERNATIONAL CONFERENCE ON COMMUNICATION SOFTWARE AND NETWORKS (ICCSN), 2017, : 1393 - 1396
  • [18] A Comparative Analysis of JSON']JSON Schema Inference Algorithms
    Lattak, Ivan Veinhardt
    Koupil, Pavel
    ENASE: PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON EVALUATION OF NOVEL APPROACHES TO SOFTWARE ENGINEERING, 2022, : 379 - 386
  • [19] Parametric schema inference for massive JSON']JSON datasets
    Baazizi, Mohamed-Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    VLDB JOURNAL, 2019, 28 (04): : 497 - 521
  • [20] Validation of Modern JSON']JSON Schema: Formalization and Complexity
    Attouche, Lyes
    Baazizi, Mohamed-Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    Scherzinger, Stefanie
    PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2024, 8 (POPL):