An End-to-End Big Data Deduplication Framework based on Online Continuous Learning

被引:0
|
作者
Elouataoui, Widad [1 ]
El Mendili, Saida [1 ]
El Alaoui, Imane [2 ]
Gahi, Youssef [1 ]
机构
[1] Ibn Tofail Univ, Natl Sch Appl Sci, Lab Engn Sci, Kenitra, Morocco
[2] Ibn Tofail Univ, Telecommun Syst & Decis Engn Lab, Kenitra, Morocco
关键词
Big data deduplication; online continual learning; big data; entity resolution; record linkage; duplicates detection;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One pre-processing the leading big data quality challenges is data duplication. Indeed, the gathered big data are usually messy and may contain duplicated records. The process of detecting and eliminating duplicated records is known as Deduplication, or Entity Resolution or also Record Linkage. Data deduplication has been widely discussed in the literature, and multiple deduplication approaches were suggested. However, few efforts have been made to address deduplication issues in Big Data Context. Also, the existing big data deduplication approaches are not handling the case of the decreasing performance of the deduplication model during the serving. In addition, most current methods are limited to duplicate detection, which is part of the deduplication process. Therefore, we aim through this paper to propose an End-to-End Big Data Deduplication Framework based on a semi-supervised learning approach that outperforms the existing big data deduplication approaches with an F-score of 98,21%, a Precision of 98,24% and a Recall of 96,48%. Moreover, the suggested framework encompasses all data deduplication phases, including data pre-processing and preparation, automated data labeling, duplicate detection, data cleaning, and an auditing and monitoring phase. This last phase is based on an online continual learning strategy for big data deduplication that allows addressing the decreasing performance of the deduplication model during the serving. The obtained results have shown that the suggested continual learning strategy has increased the model accuracy by 1,16%. Furthermore, we apply the proposed framework to three different datasets and compare its performance against the existing deduplication models. Finally, the results are discussed, conclusions are made, and future work directions are highlighted.
引用
下载
收藏
页码:281 / 291
页数:11
相关论文
共 50 条
  • [1] End-to-end data deduplication for the mobile Web
    Filipe, Ricardo
    Barreto, Joao
    2011 10TH IEEE INTERNATIONAL SYMPOSIUM ON NETWORK COMPUTING AND APPLICATIONS (NCA), 2011,
  • [2] An Adaptable Big Data Value Chain Framework for End-to-End Big Data Monetization
    Faroukhi, Abou Zakaria
    El Alaoui, Imane
    Gahi, Youssef
    Amine, Aouatif
    BIG DATA AND COGNITIVE COMPUTING, 2020, 4 (04) : 1 - 27
  • [3] A Case for End-to-End Deduplication
    Douglis, Fred
    PROCEEDINGS OF 2016 FOURTH IEEE WORKSHOP ON HOT TOPICS IN WEB SYSTEMS AND TECHNOLOGIES (HOTWEB), 2016, : 7 - 13
  • [4] CrowdRL: An End-to-End Reinforcement Learning Framework for Data Labelling
    Li, Kaiyu
    Li, Guoliang
    Wang, Yong
    Huang, Yan
    Liu, Zitao
    Wu, Zhongqin
    2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, : 289 - 300
  • [5] A framework for end-to-end learning on semantic tree-structured data
    Woof, William
    Chen, Ke
    arXiv, 2020,
  • [6] An End-to-End Learning Framework for Video Compression
    Lu, Guo
    Zhang, Xiaoyun
    Ouyang, Wanli
    Chen, Li
    Gao, Zhiyong
    Xu, Dong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (10) : 3292 - 3308
  • [7] End-to-End Privacy for Open Big Data Markets
    Perera, Charith
    Ranjan, Rajiv
    Wang, Lizhe
    IEEE CLOUD COMPUTING, 2015, 2 (04): : 44 - 53
  • [8] An Overview of End-to-End Entity Resolution for Big Data
    Christophides, Vassilis
    Efthymiou, Vasilis
    Palpanas, Themis
    Papadakis, George
    Stefanidis, Kostas
    ACM COMPUTING SURVEYS, 2021, 53 (06)
  • [9] End-to-End Learning-Based Image Compression With a Decoupled Framework
    Zhang, Zhaobin
    Esenlik, Semih
    Wu, Yaojun
    Wang, Meng
    Zhang, Kai
    Zhang, Li
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3067 - 3081
  • [10] GEDI: A Graph-based End-to-end Data Imputation Framework
    Chen, Katrina
    Liang, Xiuqin
    Ma, Zheng
    Zhang, Zhibin
    2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 723 - 730