Large Database Schema Matching using Data Mining Techniques

被引:2
|
作者
Reis, Debora G. [1 ]
Ladeira, Marcelo [1 ]
Holanda, Maristela [1 ]
Victorino, Marcio C. [2 ]
机构
[1] Univ Brasilia UnB, Dept Comp Sci, Brasilia, DF, Brazil
[2] Univ Brasilia UnB, Fac Informat Sci, Brasilia, DF, Brazil
关键词
schema matching; data mining; schema deduplication; cluster; data integration;
D O I
10.1109/ICDMW.2018.00083
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the expanding diversity of database technologies and database sizes, it is becoming increasingly hard to identify similar relational databases among many large databases stored in different Database Management Systems (DBMS). Therefore, we propose to use data mining techniques to automatically identify similar structures of relational databases by comparing their metadata, which is composed by physical details of the databases. The amount of metadata is proportional to the size of the schema structure. The possibilities of combinations for comparison is quadratic in relation to the number of schemas analyzed. Looking for the most efficient technique, we propose to calculate the schema similarity evaluating a distance of all the schemas to just one schema, which is a start point. Obviously schemas with close distances are more similar than schemas with bigger distances. We compare this proposal against two other approaches. The first approach compares all schemas against all another schemas except for its inverse comparison. The second approach compares schemas in a group of schemas with similar sizes. To validate our proposal, an experiment is performed with 354 real schemas ranging in sizes from 2 to 20 thousand metadata, totaling together more than 26 thousand tables and 238 thousand columns. Those schemas came from 5 different DBMS. The metadata extracted is transformed and formatted for comparing pairs of a schema. The textual features are compared using Cosine Distance and numerical features are compared using Euclidean Distance. Then, the hierarchical cluster technique is used to facilitate the visualization of the schema that most closely resembled one another. Results showed that, our was the most efficient because it compared all schema and identified the most similar schema by its structure in less than 2 minutes. The extracted metadata was used to create the first version of the metadata repository and an initial version of a data catalog, which contributed to the knowledge of existing data. Using this procedure, duplicated schemas were discovered and then discontinued, resulting in a cost savings of 10% of cost savings, while freeing up infrastructure resources. This solution is flexible, it supports a variety of schema sizes and DBMS.
引用
下载
收藏
页码:523 / 530
页数:8
相关论文
共 50 条
  • [1] Using Active Learning Techniques for Improving Database Schema Matching Methods
    Rodrigues, Diego
    da Silva, Altigran
    Rodrigues, Rosiane
    dos Santos, Eulanda
    2015 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2015,
  • [2] Using linguistic techniques for schema matching
    Unal, Ozgul
    Afsarmanesh, Hamideh
    ICSOFT 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES, VOL 2, 2006, : 115 - +
  • [3] Large-scale data analysis on aviation accident database using different data mining techniques
    Christopher, A. B. Arockia
    Vivekanandam, V. Shunmughavel
    Anderson, A. B. Antony
    Markkandeyan, S.
    Sivakumar, V.
    AERONAUTICAL JOURNAL, 2016, 120 (1234): : 1849 - 1866
  • [4] Schema matching and database integration
    Karasnehl, Yaser
    Ibrahim, Hamidah
    Othman, Mohamed
    Yaakob, Razali
    World Academy of Science, Engineering and Technology, 2009, 38 : 1205 - 1208
  • [5] Database conceptual schema matching
    Casanova, Marco A.
    Breitman, Karin K.
    Brauner, Daniela F.
    Marins, Andre L. A.
    COMPUTER, 2007, 40 (10) : 102 - 104
  • [6] Managing Multiuser Database Buffers Using Data Mining Techniques
    Ling Feng
    Hongjun Lu
    Knowledge and Information Systems, 2004, 6 : 679 - 709
  • [7] Managing Multiuser Database Buffers Using Data Mining Techniques
    Feng, Ling
    Lu, Hongjun
    KNOWLEDGE AND INFORMATION SYSTEMS, 2004, 6 (06) : 679 - 709
  • [8] Data mining in a large database environment
    Sung, SY
    Wang, K
    Chua, BL
    INFORMATION INTELLIGENCE AND SYSTEMS, VOLS 1-4, 1996, : 988 - 993
  • [9] MDSM: Microarray database schema matching using the Hungarian method
    Chen, Yi-Ping Phoebe
    Promparmote, Supawan
    Maire, Frederic
    INFORMATION SCIENCES, 2006, 176 (19) : 2771 - 2790
  • [10] Analysis the effect of data mining techniques on database
    Aggarwal, Niyati
    Kumar, Amit
    Khatter, Harsh
    Aggarwal, Vaishali
    ADVANCES IN ENGINEERING SOFTWARE, 2012, 47 (01) : 164 - 169