Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration

被引:0
|
作者
Necba, Hanae [1 ]
Rhanoui, Maryem [1 ,2 ]
El Asri, Bouchra [1 ]
机构
[1] Mohammed V Univ, ENSIAS, Rabat IT Ctr, IMS Team,ADMIR Lab, Rabat, Morocco
[2] Sch Informat Sci, LYRICA Lab, Meridian Team, Rabat, Morocco
关键词
Machine Learning; Data quality; Name matching; Affinity propagation; Levenshtein distance; Clustering; Unsupervised learning; Scikit learn; Data integration problems; BIG DATA; MANAGEMENT;
D O I
10.1007/978-3-319-96292-4_16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.
引用
收藏
页码:197 / 209
页数:13
相关论文
共 50 条
  • [21] A Practical Application of Unsupervised Machine Learning for Analyzing Plant Image Data Collected Using Unmanned Aircraft Systems
    Davis, Roy L.
    Greene, Jeremy K.
    Dou, Fugen
    Jo, Young-Ki
    Chappell, Thomas M.
    AGRONOMY-BASEL, 2020, 10 (05):
  • [22] Construction and application of a financial big data analysis model based on machine learning
    Pang L.
    Liu Y.
    Revue d'Intelligence Artificielle, 2020, 34 (03) : 345 - 350
  • [23] Machine learning methods for transcription data integration
    Holloway, Dustin T.
    Kon, Mark A.
    DeLisi, Charles
    IBM Journal of Research and Development, 2006, 50 (06): : 631 - 643
  • [24] Data Integration and Machine Learning: A Natural Synergy
    Dong, Xin Luna
    Rekatsinas, Theodoros
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (12): : 2094 - 2097
  • [25] Machine learning methods for transcription data integration
    Holloway, D. T.
    Kon, M. A.
    DeLisi, C.
    IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2006, 50 (06) : 631 - 643
  • [26] Predicting childhood asthma using machine learning and data integration approaches
    Kothalawala, Dilini
    Murray, Clare
    Simpson, Angela
    Custovic, Adnan
    Tapper, William
    Arshad, Hasan
    Holloway, John
    Rezwan, Faisal
    CLINICAL AND EXPERIMENTAL ALLERGY, 2021, 51 (12): : 1683 - 1683
  • [27] Data Integration and Machine Learning: A Natural Synergy
    Dong, Xin Luna
    Rekatsinas, Theodoros
    SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 1645 - 1650
  • [28] Data Integration and Machine Learning: A Natural Synergy
    Dong, Xin Luna
    Rekatsinas, Theodoros
    KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 3193 - 3194
  • [29] Integration of metabolomics, lipidomics and clinical data using a machine learning method
    Animesh Acharjee
    Zsuzsanna Ament
    James A. West
    Elizabeth Stanley
    Julian L. Griffin
    BMC Bioinformatics, 17
  • [30] Integration of metabolomics, lipidomics and clinical data using a machine learning method
    Acharjee, Animesh
    Ament, Zsuzsanna
    West, James A.
    Stanley, Elizabeth
    Griffin, Julian L.
    BMC BIOINFORMATICS, 2016, 17