A model-based evaluation of data quality activities in KDD

被引:25
|
作者
Mezzanzanica, Mario [1 ,2 ]
Boselli, Roberto [1 ,2 ]
Cesarini, Mirko [1 ,2 ]
Mercorio, Fabio [2 ]
机构
[1] Univ Milano Bicocca, Dept Stat & Quantitat Methods, I-20126 Milan, Italy
[2] Univ Milano Bicocca, CRISP Res Ctr, I-20126 Milan, Italy
关键词
Data quality; Data cleansing; Model checking; Real-life application; CHECKING; KNOWLEDGE;
D O I
10.1016/j.ipm.2014.07.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We live in the Information Age, where most of the personal, business, and administrative data are collected and managed electronically. However, poor data quality may affect the effectiveness of knowledge discovery processes, thus making the development of the data improvement steps a significant concern. In this paper we propose the Multidimensional Robust Data Quality Analysis, a domain-independent technique aimed to improve data quality by evaluating the effectiveness of a black-box cleansing function. Here, the proposed approach has been realized through model checking techniques and then applied on a weakly structured dataset describing the working careers of millions of people. Our experimental outcomes show the effectiveness of our model-based approach for data quality as they provide a fine-grained analysis of both the source dataset and the cleansing procedures, enabling domain experts to identify the most relevant quality issues as well as the action points for improving the cleansing activities. Finally, an anonymized version of the dataset and the analysis results have been made publicly available to the community. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:144 / 166
页数:23
相关论文
共 50 条
  • [31] Model-based clustering for longitudinal data
    De la Cruz-Mesia, Rolando
    Quintanab, Fernando A.
    Marshall, Guillermo
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (03) : 1441 - 1457
  • [32] Model-Based Clustering of Temporal Data
    El Assaad, Hani
    Same, Allou
    Govaert, Gerard
    Aknin, Patrice
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2013, 2013, 8131 : 9 - 16
  • [33] Model-based integration and interpretation of data
    Petersen, J
    2004 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN & CYBERNETICS, VOLS 1-7, 2004, : 815 - 820
  • [34] Model-based acceptance test evaluation
    Pechtl, P
    Hartner, P
    Posch, M
    Petek, J
    MODELLING AND SIMULATION OF STEAM GENERATORS AND FIRING SYSTEMS, 2000, 1534 : 101 - 110
  • [35] Tools for the model-based performance evaluation
    Beilner, Heinz
    IT - Information Technology, 1995, 37 (03): : 5 - 9
  • [36] Optimization of Data Collection Strategies for Model-Based Evaluation and Decision-Making
    Cain, Robert
    van Moorsel, Aad
    2012 42ND ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2012,
  • [37] The algorithmic anatomy of model-based evaluation
    Daw, Nathaniel D.
    Dayan, Peter
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2014, 369 (1655)
  • [38] Model-based Evaluation Environment for Sustainability
    Oertwig, Nicole
    Wintrich, Nikolaus
    Jochem, Roland
    12TH GLOBAL CONFERENCE ON SUSTAINABLE MANUFACTURING - EMERGING POTENTIALS, 2015, 26 : 641 - 645
  • [39] Model-based evaluation of grinding experiments
    Müller, F
    Polke, R
    Schäfer, M
    POWDER TECHNOLOGY, 1999, 105 (1-3) : 243 - 249
  • [40] An Initial Evaluation of Model-Based Testing
    Gudmundsson, Vignir
    Schulze, Christoph
    Ganesan, Dharmalingam
    Lindvall, Mikael
    Wiegand, Robert
    2013 IEEE INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING WORKSHOPS (ISSREW), 2013, : 13 - +