A model-based evaluation of data quality activities in KDD

被引:25
|
作者
Mezzanzanica, Mario [1 ,2 ]
Boselli, Roberto [1 ,2 ]
Cesarini, Mirko [1 ,2 ]
Mercorio, Fabio [2 ]
机构
[1] Univ Milano Bicocca, Dept Stat & Quantitat Methods, I-20126 Milan, Italy
[2] Univ Milano Bicocca, CRISP Res Ctr, I-20126 Milan, Italy
关键词
Data quality; Data cleansing; Model checking; Real-life application; CHECKING; KNOWLEDGE;
D O I
10.1016/j.ipm.2014.07.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We live in the Information Age, where most of the personal, business, and administrative data are collected and managed electronically. However, poor data quality may affect the effectiveness of knowledge discovery processes, thus making the development of the data improvement steps a significant concern. In this paper we propose the Multidimensional Robust Data Quality Analysis, a domain-independent technique aimed to improve data quality by evaluating the effectiveness of a black-box cleansing function. Here, the proposed approach has been realized through model checking techniques and then applied on a weakly structured dataset describing the working careers of millions of people. Our experimental outcomes show the effectiveness of our model-based approach for data quality as they provide a fine-grained analysis of both the source dataset and the cleansing procedures, enabling domain experts to identify the most relevant quality issues as well as the action points for improving the cleansing activities. Finally, an anonymized version of the dataset and the analysis results have been made publicly available to the community. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:144 / 166
页数:23
相关论文
共 50 条
  • [1] Towards a quality model for the evaluation of DSS based on KDD process
    Ben Ayed, Emna
    Ben Ayed, Mounir
    2013 INTERNATIONAL CONFERENCE ON ADVANCED LOGISTICS AND TRANSPORT (ICALT), 2013, : 187 - 192
  • [2] Model-Based Evaluation Approach for Quality of Web Service
    Gong, Yan
    Huang, Lin
    Shu, Zhiyong
    Han, Ke
    COMPUTER APPLICATIONS FOR COMMUNICATION, NETWORKING, AND DIGITAL CONTENTS, 2012, 350 : 64 - +
  • [3] Model-based methods for quality evaluation of cloud services
    Adiththan, Arun
    Ravindran, Kaliappa
    IEEE 17TH INT CONF ON DEPENDABLE, AUTONOM AND SECURE COMP / IEEE 17TH INT CONF ON PERVAS INTELLIGENCE AND COMP / IEEE 5TH INT CONF ON CLOUD AND BIG DATA COMP / IEEE 4TH CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/CBDCOM/CYBERSCITECH), 2019, : 687 - 692
  • [4] Model-Based Approach for Evaluation of Pooled Measurement Data
    Chunovkina, A. G.
    2017 11TH INTERNATIONAL CONFERENCE ON MEASUREMENT, 2017, : 3 - 8
  • [5] An Evaluation of Model-Based Approaches to Sensor Data Compression
    Nguyen Quoc Viet Hung
    Jeung, Hoyoung
    Aberer, Karl
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (11) : 2434 - 2447
  • [6] A model-based approach for the evaluation of vagal and sympathetic activities in a newborn lamb
    Le Rolle, Virginie
    Ojeda, David
    Beuchee, Alain
    Praud, Jean-Paul
    Pladys, Patrick
    Hernandez, Alfredo I.
    2013 35TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2013, : 3881 - 3884
  • [7] Model-based clustering for spatiotemporal data on air quality monitoring
    Cheam, A. S. M.
    Marbac, M.
    McNicholas, P. D.
    ENVIRONMETRICS, 2017, 28 (03)
  • [8] Quality evaluation of the model-based forecasts of implied volatility index
    Leczycka, Katarzyna
    9TH PROFESSOR ALEKSANDER ZELIAS INTERNATIONAL CONFERENCE ON MODELLING AND FORECASTING OF SOCIO-ECONOMIC PHENOMENA, 2015, : 136 - 144
  • [9] A model-based framework for air quality indices and population risk evaluation, with an application to the analysis of Scottish air quality data
    Finazzi, Francesco
    Scott, E. Marian
    Fasso, Alessandro
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2013, 62 (02) : 287 - 308
  • [10] Model-based mean adjustment in quantitative germplasm evaluation data
    H.P. Piepho
    Genetic Resources and Crop Evolution, 2003, 50 : 281 - 290