Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration

被引:0
|
作者
Necba, Hanae [1 ]
Rhanoui, Maryem [1 ,2 ]
El Asri, Bouchra [1 ]
机构
[1] Mohammed V Univ, ENSIAS, Rabat IT Ctr, IMS Team,ADMIR Lab, Rabat, Morocco
[2] Sch Informat Sci, LYRICA Lab, Meridian Team, Rabat, Morocco
关键词
Machine Learning; Data quality; Name matching; Affinity propagation; Levenshtein distance; Clustering; Unsupervised learning; Scikit learn; Data integration problems; BIG DATA; MANAGEMENT;
D O I
10.1007/978-3-319-96292-4_16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.
引用
收藏
页码:197 / 209
页数:13
相关论文
共 50 条
  • [1] Data Integration using Machine Learning
    Birgersson, Marcus
    Hansson, Gustav
    Franke, Ulrik
    2016 IEEE 20TH INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING WORKSHOP (EDOCW), 2016, : 313 - 322
  • [2] Unsupervised Machine Learning Clustering of Seismic and Infrasound Data Quality Metrics
    Coffey, Juliann R.
    Witsil, Alex J. C.
    Macpherson, Kenneth A.
    Fee, David
    SEISMOLOGICAL RESEARCH LETTERS, 2024, 95 (03) : 1812 - 1833
  • [3] Exploration of critical care data by using unsupervised machine learning
    Hyun, Sookyung
    Kaewprag, Pacharmon
    Cooper, Cheryl
    Hixon, Brenda
    Moffatt-Bruce, Susan
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2020, 194
  • [4] Data Integration in Machine Learning
    Li, Yifeng
    Ngom, Alioune
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2015, : 1665 - 1671
  • [5] Spatial data quality.
    Heuvelink, GBM
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2003, 17 (08) : 816 - U2
  • [6] Machine learning and financial big data control using IoT
    Xiao, Jian
    Intelligent Decision Technologies, 2024, 18 (04) : 2657 - 2670
  • [7] Detecting Anomalies in Financial Data Using Machine Learning Algorithms
    Bakumenko, Alexander
    Elragal, Ahmed
    SYSTEMS, 2022, 10 (05):
  • [8] Data Oriented Financial Analysis using Machine Learning Methods
    Altan, Cisem
    Kalayci, Sacide
    Koroglu, Bilge
    2020 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2020, : 37 - 41
  • [9] Machine Learning for Medical Data Integration
    Mueller, Armin
    Christmann, Lara-Sophie
    Kohler, Severin
    Eils, Roland
    Prasser, Fabian
    CARING IS SHARING-EXPLOITING THE VALUE IN DATA FOR HEALTH AND INNOVATION-PROCEEDINGS OF MIE 2023, 2023, 302 : 691 - 695
  • [10] Quality of Data in Machine Learning
    Kariluoto, Antti
    Kultanen, Joni
    Soininen, Jukka
    Parnanen, Arto
    Abrahamsson, Pekka
    2021 21ST INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C 2021), 2021, : 216 - 221