An End-to-End Big Data Deduplication Framework based on Online Continuous Learning

被引:0
|
作者
Elouataoui, Widad [1 ]
El Mendili, Saida [1 ]
El Alaoui, Imane [2 ]
Gahi, Youssef [1 ]
机构
[1] Ibn Tofail Univ, Natl Sch Appl Sci, Lab Engn Sci, Kenitra, Morocco
[2] Ibn Tofail Univ, Telecommun Syst & Decis Engn Lab, Kenitra, Morocco
关键词
Big data deduplication; online continual learning; big data; entity resolution; record linkage; duplicates detection;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One pre-processing the leading big data quality challenges is data duplication. Indeed, the gathered big data are usually messy and may contain duplicated records. The process of detecting and eliminating duplicated records is known as Deduplication, or Entity Resolution or also Record Linkage. Data deduplication has been widely discussed in the literature, and multiple deduplication approaches were suggested. However, few efforts have been made to address deduplication issues in Big Data Context. Also, the existing big data deduplication approaches are not handling the case of the decreasing performance of the deduplication model during the serving. In addition, most current methods are limited to duplicate detection, which is part of the deduplication process. Therefore, we aim through this paper to propose an End-to-End Big Data Deduplication Framework based on a semi-supervised learning approach that outperforms the existing big data deduplication approaches with an F-score of 98,21%, a Precision of 98,24% and a Recall of 96,48%. Moreover, the suggested framework encompasses all data deduplication phases, including data pre-processing and preparation, automated data labeling, duplicate detection, data cleaning, and an auditing and monitoring phase. This last phase is based on an online continual learning strategy for big data deduplication that allows addressing the decreasing performance of the deduplication model during the serving. The obtained results have shown that the suggested continual learning strategy has increased the model accuracy by 1,16%. Furthermore, we apply the proposed framework to three different datasets and compare its performance against the existing deduplication models. Finally, the results are discussed, conclusions are made, and future work directions are highlighted.
引用
收藏
页码:281 / 291
页数:11
相关论文
共 50 条
  • [31] End-to-End Learning-Based Framework for Amplify-and-Forward Relay Networks
    Gupta, Ankit
    Sellathurai, Mathini
    IEEE ACCESS, 2021, 9 : 81660 - 81677
  • [32] An End-to-End Framework for Machine Learning-Based Network Intrusion Detection System
    De Carvalho Bertoli, Gustavo
    Pereira Junior, Lourenco Alves
    Saotome, Osamu
    Dos Santos, Aldri L.
    Verri, Filipe Alves Neto
    Marcondes, Cesar Augusto Cavalheiro
    Barbieri, Sidnei
    Rodrigues, Moises S.
    Parente De Oliveira, Jose M.
    IEEE ACCESS, 2021, 9 : 106790 - 106805
  • [33] A Simulation-based End-to-End Learning Framework for Evidential Occupancy Grid Mapping
    van Kempen, Raphael
    Lampe, Bastian
    Woopen, Timo
    Eckstein, Lutz
    2021 32ND IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2021, : 934 - 939
  • [34] End-to-End Learning-Based Framework for Amplify-and-Forward Relay Networks
    Gupta, Ankit
    Sellathurai, Mathini
    Gupta, Ankit (ag104@hw.ac.uk), 1600, Institute of Electrical and Electronics Engineers Inc. (09): : 81660 - 81677
  • [35] A hybrid framework for sequential data prediction with end-to-end optimization
    Aydin, Mustafa E.
    Kozat, Suleyman S.
    DIGITAL SIGNAL PROCESSING, 2022, 129
  • [36] End-to-End Learning Framework for IMU-Based 6-DOF Odometry
    do Monte Lima, Joao Paulo Silva
    Uchiyama, Hideaki
    Taniguchi, Rin-ichiro
    SENSORS, 2019, 19 (17)
  • [37] A hybrid framework for sequential data prediction with end-to-end optimization
    Aydin, Mustafa E.
    Kozat, Suleyman S.
    DIGITAL SIGNAL PROCESSING, 2022, 129
  • [38] An end-to-end framework for flight trajectory data analysis based on deep autoencoder network
    Zhang, Weining
    Hu, Minghua
    Du, Jinghan
    AEROSPACE SCIENCE AND TECHNOLOGY, 2022, 127
  • [39] A Data-Driven Fault Prediction Method for Nuclear Power Systems Based on End-to-End Deep Learning Framework
    Chao, Lu
    Wang, Chunbing
    Chen, Shuai
    Duan, Qizhi
    Xie, Hongyun
    SCIENCE AND TECHNOLOGY OF NUCLEAR INSTALLATIONS, 2022, 2022
  • [40] Spectrum Monitoring Based on End-to-End Learning by Deep Learning
    Rahmani, Mahdiyeh
    Ghazizadeh, Reza
    INTERNATIONAL JOURNAL OF WIRELESS INFORMATION NETWORKS, 2022, 29 (02) : 180 - 192