Large Database Schema Matching using Data Mining Techniques

被引：2

作者：

Reis, Debora G. ^{[1
]}

Ladeira, Marcelo ^{[1
]}

Holanda, Maristela ^{[1
]}

Victorino, Marcio C. ^{[2
]}

机构：

[1] Univ Brasilia UnB, Dept Comp Sci, Brasilia, DF, Brazil

[2] Univ Brasilia UnB, Fac Informat Sci, Brasilia, DF, Brazil

来源：

2018 18TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW) | 2018年

关键词：

schema matching; data mining; schema deduplication; cluster; data integration;

D O I：

10.1109/ICDMW.2018.00083

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the expanding diversity of database technologies and database sizes, it is becoming increasingly hard to identify similar relational databases among many large databases stored in different Database Management Systems (DBMS). Therefore, we propose to use data mining techniques to automatically identify similar structures of relational databases by comparing their metadata, which is composed by physical details of the databases. The amount of metadata is proportional to the size of the schema structure. The possibilities of combinations for comparison is quadratic in relation to the number of schemas analyzed. Looking for the most efficient technique, we propose to calculate the schema similarity evaluating a distance of all the schemas to just one schema, which is a start point. Obviously schemas with close distances are more similar than schemas with bigger distances. We compare this proposal against two other approaches. The first approach compares all schemas against all another schemas except for its inverse comparison. The second approach compares schemas in a group of schemas with similar sizes. To validate our proposal, an experiment is performed with 354 real schemas ranging in sizes from 2 to 20 thousand metadata, totaling together more than 26 thousand tables and 238 thousand columns. Those schemas came from 5 different DBMS. The metadata extracted is transformed and formatted for comparing pairs of a schema. The textual features are compared using Cosine Distance and numerical features are compared using Euclidean Distance. Then, the hierarchical cluster technique is used to facilitate the visualization of the schema that most closely resembled one another. Results showed that, our was the most efficient because it compared all schema and identified the most similar schema by its structure in less than 2 minutes. The extracted metadata was used to create the first version of the metadata repository and an initial version of a data catalog, which contributed to the knowledge of existing data. Using this procedure, duplicated schemas were discovered and then discontinued, resulting in a cost savings of 10% of cost savings, while freeing up infrastructure resources. This solution is flexible, it supports a variety of schema sizes and DBMS.

引用

下载

页码：523 / 530

页数：8

共 50 条

[1] Using Active Learning Techniques for Improving Database Schema Matching Methods
Rodrigues, Diego
da Silva, Altigran
Rodrigues, Rosiane
dos Santos, Eulanda
2015 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2015,
[2] Using linguistic techniques for schema matching
Unal, Ozgul
Afsarmanesh, Hamideh
ICSOFT 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES, VOL 2, 2006, : 115 - +
[3] Large-scale data analysis on aviation accident database using different data mining techniques
Christopher, A. B. Arockia
Vivekanandam, V. Shunmughavel
Anderson, A. B. Antony
Markkandeyan, S.
Sivakumar, V.
AERONAUTICAL JOURNAL, 2016, 120 (1234): : 1849 - 1866
[4] Schema matching and database integration
Karasnehl, Yaser
Ibrahim, Hamidah
Othman, Mohamed
Yaakob, Razali
World Academy of Science, Engineering and Technology, 2009, 38 : 1205 - 1208
[5] Database conceptual schema matching
Casanova, Marco A.
Breitman, Karin K.
Brauner, Daniela F.
Marins, Andre L. A.
COMPUTER, 2007, 40 (10) : 102 - 104
[6] Managing Multiuser Database Buffers Using Data Mining Techniques
Ling Feng
Hongjun Lu
Knowledge and Information Systems, 2004, 6 : 679 - 709
[7] Managing Multiuser Database Buffers Using Data Mining Techniques
Feng, Ling
Lu, Hongjun
KNOWLEDGE AND INFORMATION SYSTEMS, 2004, 6 (06) : 679 - 709
[8] Data mining in a large database environment
Sung, SY
Wang, K
Chua, BL
INFORMATION INTELLIGENCE AND SYSTEMS, VOLS 1-4, 1996, : 988 - 993
[9] MDSM: Microarray database schema matching using the Hungarian method
Chen, Yi-Ping Phoebe
Promparmote, Supawan
Maire, Frederic
INFORMATION SCIENCES, 2006, 176 (19) : 2771 - 2790
[10] Analysis the effect of data mining techniques on database
Aggarwal, Niyati
Kumar, Amit
Khatter, Harsh
Aggarwal, Vaishali
ADVANCES IN ENGINEERING SOFTWARE, 2012, 47 (01) : 164 - 169

← 1 2 3 4 5 →