From Theory to Practice: A Data Quality Framework for Classification Tasks

被引:17
|
作者
Camilo Corrales, David [1 ,2 ]
Ledezma, Agapito [2 ]
Carlos Corrales, Juan [1 ]
机构
[1] Univ Cauca, Grp Ingn Telemat, Campus Tulcan, Popayan 190002, Colombia
[2] Univ Carlos III Madrid, Dept Informat, Ave Univ 30, Leganes 28911, Spain
来源
SYMMETRY-BASEL | 2018年 / 10卷 / 07期
关键词
DQF4CT; data quality issue; classification task; conceptual framework; data cleaning ontology; FEATURE-SELECTION; VERTEBRAL COLUMN; ONTOLOGIES; KNOWLEDGE; MODELS; PRINCIPLES; IMPUTATION; NOISE;
D O I
10.3390/sym10070248
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The data preprocessing is an essential step in knowledge discovery projects. The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors.
引用
收藏
页数:29
相关论文
共 50 条
  • [1] Data Quality: From Theory to Practice
    Fan, Wenfei
    [J]. SIGMOD RECORD, 2015, 44 (03) : 7 - 18
  • [2] Corporate Data Quality Management From Theory to Practice
    Lucas, Ana
    [J]. SISTEMAS Y TECNOLOGIAS DE INFORMACION, 2010, : 542 - 548
  • [3] A General Framework for Data Uncertainty and Quality Classification
    Simard, Vanessa
    Ronnqvist, Mikael
    Lebel, Luc
    Lehoux, Nadia
    [J]. IFAC PAPERSONLINE, 2019, 52 (13): : 277 - 282
  • [4] A Data-centric AI Framework for Automating Exploratory Data Analysis and Data Quality Tasks
    Patel, Hima
    Guttula, Shanmukha
    Gupta, Nitin
    Hans, Sandeep
    Mittal, Ruhi Sharma
    Lokesh, N.
    [J]. ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2023, 15 (04):
  • [5] Software Quality: from Theory to Practice
    Richardson, Ita
    Delaney, Yvonne
    [J]. QUATIC 2010: SEVENTH INTERNATIONAL CONFERENCE ON THE QUALITY OF INFORMATION AND COMMUNICATIONS TECHNOLOGY, 2010, : 150 - 155
  • [6] Quality of care: From theory to practice
    Guillain, H
    Raetzo, MA
    [J]. SCHWEIZERISCHE MEDIZINISCHE WOCHENSCHRIFT, 1997, 127 (13) : 541 - 548
  • [7] Packet Classification Algorithms: From Theory to Practice
    Qi, Yaxuan
    Xu, Lianghong
    Yang, Baohua
    Xue, Yibo
    Li, Jun
    [J]. IEEE INFOCOM 2009 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS, VOLS 1-5, 2009, : 648 - +
  • [8] Semistructured data: from practice to theory
    Abiteboul, S
    [J]. 16TH ANNUAL IEEE SYMPOSIUM ON LOGIC IN COMPUTER SCIENCE, PROCEEDINGS, 2001, : 379 - 386
  • [9] Classification of primitive manufacturing tasks from filtered event data
    Duarte, Laura
    Neto, Pedro
    [J]. JOURNAL OF MANUFACTURING SYSTEMS, 2023, 68 : 12 - 24
  • [10] Data Quality Estimation Framework for Faster Tax Code Classification
    Kondadadi, Ravi
    Williams, Allen
    Nicolov, Nicolas
    [J]. PROCEEDINGS OF THE 5TH WORKSHOP ON E-COMMERCE AND NLP (ECNLP 5), 2022, : 29 - 34