Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration

被引:0
|
作者
Necba, Hanae [1 ]
Rhanoui, Maryem [1 ,2 ]
El Asri, Bouchra [1 ]
机构
[1] Mohammed V Univ, ENSIAS, Rabat IT Ctr, IMS Team,ADMIR Lab, Rabat, Morocco
[2] Sch Informat Sci, LYRICA Lab, Meridian Team, Rabat, Morocco
关键词
Machine Learning; Data quality; Name matching; Affinity propagation; Levenshtein distance; Clustering; Unsupervised learning; Scikit learn; Data integration problems; BIG DATA; MANAGEMENT;
D O I
10.1007/978-3-319-96292-4_16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.
引用
收藏
页码:197 / 209
页数:13
相关论文
共 50 条
  • [31] Interactive Machine Learning for Laboratory Data Integration
    Fillmore, Nathanael
    Do, Nhan
    Brophy, Mary
    Zimolzak, Andrew
    MEDINFO 2019: HEALTH AND WELLBEING E-NETWORKS FOR ALL, 2019, 264 : 133 - 137
  • [32] Amalur: The Convergence of Data Integration and Machine Learning
    Li, Ziyu
    Sun, Wenbo
    Zhan, Danning
    Kang, Yan
    Chen, Lydia
    Bozzon, Alessandro
    Hai, Rihan
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 7353 - 7367
  • [33] Sentiment Analysis of Financial Textual data Using Machine Learning and Deep Learning Models
    Ahmad H.O.
    Umar S.U.
    Informatica (Slovenia), 2023, 47 (05): : 153 - 158
  • [34] Quality Assessment of Data Using Statistical and Machine Learning Methods
    Singh, Prerna
    Suri, Bharti
    COMPUTATIONAL INTELLIGENCE IN DATA MINING, VOL 2, 2015, 32 : 89 - 97
  • [35] Data Quality for Machine Learning Tasks
    Gupta, Nitin
    Mujumdar, Shashank
    Patel, Hima
    Masuda, Satoshi
    Panwar, Naveen
    Bandyopadhyay, Sambaran
    Mehta, Sameep
    Guttula, Shanmukha
    Afzal, Shazia
    Mittal, Ruhi Sharma
    Munigala, Vitobha
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4040 - 4041
  • [36] Predicting Quality Medical Drug Data Towards Meaningful Data using Machine Learning
    Al-Showarah, Suleyman
    Al-Taie, Abubaker
    Salman, Hamzeh Eyal
    Alzyadat, Wael
    Alkhalaileh, Mohannad
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (08) : 1052 - 1059
  • [37] Improving the Quality of Art Market Data Using Linked Open Data and Machine Learning
    Filipiak, Dominik
    Filipowska, Agata
    BUSINESS INFORMATION SYSTEMS WORKSHOPS, BIS 2016, 2017, 263 : 418 - 428
  • [38] Application of machine learning in ocean data
    Lou, Ranran
    Lv, Zhihan
    Dang, Shuping
    Su, Tianyun
    Li, Xinfang
    MULTIMEDIA SYSTEMS, 2023, 29 (03) : 1815 - 1824
  • [39] Application of Machine Learning for Cytometry Data
    Hu, Zicheng
    Bhattacharya, Sanchita
    Butte, Atul J.
    FRONTIERS IN IMMUNOLOGY, 2022, 12
  • [40] Application of machine learning in ocean data
    Ranran Lou
    Zhihan Lv
    Shuping Dang
    Tianyun Su
    Xinfang Li
    Multimedia Systems, 2023, 29 : 1815 - 1824