Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration

被引:0
|
作者
Necba, Hanae [1 ]
Rhanoui, Maryem [1 ,2 ]
El Asri, Bouchra [1 ]
机构
[1] Mohammed V Univ, ENSIAS, Rabat IT Ctr, IMS Team,ADMIR Lab, Rabat, Morocco
[2] Sch Informat Sci, LYRICA Lab, Meridian Team, Rabat, Morocco
关键词
Machine Learning; Data quality; Name matching; Affinity propagation; Levenshtein distance; Clustering; Unsupervised learning; Scikit learn; Data integration problems; BIG DATA; MANAGEMENT;
D O I
10.1007/978-3-319-96292-4_16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.
引用
收藏
页码:197 / 209
页数:13
相关论文
共 50 条
  • [41] Unsupervised learning using topological data augmentation
    Balabanov, Oleksandr
    Granath, Mats
    PHYSICAL REVIEW RESEARCH, 2020, 2 (01):
  • [42] Data mining application in prosecution committee for unsupervised learning
    Liu, P
    Zhu, JX
    Liu, LJ
    Li, YH
    Zhang, XF
    2005 INTERNATIONAL CONFERENCE ON SERVICES SYSTEMS AND SERVICES MANAGEMENT, VOLS 1 AND 2, PROCEEDINGS, 2005, : 1061 - 1064
  • [43] An unsupervised machine learning approach using passive movement data to understand depression and schizophrenia
    Price, George D.
    Heinz, Michael V.
    Zhao, Daniel
    Nemesure, Matthew
    Ruan, Franklin
    Jacobson, Nicholas C.
    JOURNAL OF AFFECTIVE DISORDERS, 2022, 316 : 132 - 139
  • [44] Data-driven categorization of postoperative delirium symptoms using unsupervised machine learning
    Sri-iesaranusorn, Panyawut
    Sadahiro, Ryoichi
    Murakami, Syo
    Wada, Saho
    Shimizu, Ken
    Yoshida, Teruhiko
    Aoki, Kazunori
    Uezono, Yasuhito
    Matsuoka, Hiromichi
    Ikeda, Kazushi
    Yoshimoto, Junichiro
    FRONTIERS IN PSYCHIATRY, 2023, 14
  • [45] Data-driven track geometry fault localisation using unsupervised machine learning
    Popov, K.
    De Bold, R.
    Chai, H. -K.
    Forde, M. C.
    Ho, C. L.
    Hyslip, J. P.
    Kashani, H. F.
    Kelly, R.
    Hsu, S. S.
    Rippin, M.
    CONSTRUCTION AND BUILDING MATERIALS, 2023, 377
  • [46] Comparison of supervised and unsupervised machine learning techniques for UXO classification using EMI data
    Bijamov, Alex
    Shubitidze, Fridon
    Fernandez, Juan Pablo
    Shamatava, Irma
    Barrowes, Benjamin E.
    O'Neill, Kevin
    DETECTION AND SENSING OF MINES, EXPLOSIVE OBJECTS, AND OBSCURED TARGETS XVI, 2011, 8017
  • [47] A Two Step Unsupervised Learning Approach to Diagnose Machine Fault Using Big Data
    Sharmila, V. J.
    Florinabel, D. Jemi
    INFORMATION TECHNOLOGY AND CONTROL, 2022, 51 (01): : 78 - 85
  • [48] Workflow for Evaluating Normalization Tools for Omics Data Using Supervised and Unsupervised Machine Learning
    Chua, Aleesa E.
    Pfeifer, Leah D.
    Sekera, Emily R.
    Hummon, Amanda B.
    Desaire, Heather
    JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY, 2023, 34 (12) : 2775 - 2784
  • [49] Unsupervised Machine Learning for Augmented Data Analytics of Building Codes
    Zhang, Ruichuan
    El-Gohary, Nora
    COMPUTING IN CIVIL ENGINEERING 2019: DATA, SENSING, AND ANALYTICS, 2019, : 74 - 81
  • [50] Unsupervised machine learning for data-driven representations of reactions
    Sirumalla, Sai Krishna
    West, Richard
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2018, 256