An End-to-End Big Data Deduplication Framework based on Online Continuous Learning

被引:0
|
作者
Elouataoui, Widad [1 ]
El Mendili, Saida [1 ]
El Alaoui, Imane [2 ]
Gahi, Youssef [1 ]
机构
[1] Ibn Tofail Univ, Natl Sch Appl Sci, Lab Engn Sci, Kenitra, Morocco
[2] Ibn Tofail Univ, Telecommun Syst & Decis Engn Lab, Kenitra, Morocco
关键词
Big data deduplication; online continual learning; big data; entity resolution; record linkage; duplicates detection;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One pre-processing the leading big data quality challenges is data duplication. Indeed, the gathered big data are usually messy and may contain duplicated records. The process of detecting and eliminating duplicated records is known as Deduplication, or Entity Resolution or also Record Linkage. Data deduplication has been widely discussed in the literature, and multiple deduplication approaches were suggested. However, few efforts have been made to address deduplication issues in Big Data Context. Also, the existing big data deduplication approaches are not handling the case of the decreasing performance of the deduplication model during the serving. In addition, most current methods are limited to duplicate detection, which is part of the deduplication process. Therefore, we aim through this paper to propose an End-to-End Big Data Deduplication Framework based on a semi-supervised learning approach that outperforms the existing big data deduplication approaches with an F-score of 98,21%, a Precision of 98,24% and a Recall of 96,48%. Moreover, the suggested framework encompasses all data deduplication phases, including data pre-processing and preparation, automated data labeling, duplicate detection, data cleaning, and an auditing and monitoring phase. This last phase is based on an online continual learning strategy for big data deduplication that allows addressing the decreasing performance of the deduplication model during the serving. The obtained results have shown that the suggested continual learning strategy has increased the model accuracy by 1,16%. Furthermore, we apply the proposed framework to three different datasets and compare its performance against the existing deduplication models. Finally, the results are discussed, conclusions are made, and future work directions are highlighted.
引用
收藏
页码:281 / 291
页数:11
相关论文
共 50 条
  • [41] Spectrum Monitoring Based on End-to-End Learning by Deep Learning
    Mahdiyeh Rahmani
    Reza Ghazizadeh
    International Journal of Wireless Information Networks, 2022, 29 : 180 - 192
  • [42] Learning Adaptive Downsampling Encoding for Online End-to-End Speech Recognition
    Na, Rui
    Lou, Junfeng
    Guo, Wu
    Song, Yan
    Dai, Lirong
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 850 - 854
  • [43] An End-to-End Deep Graph Clustering via Online Mutual Learning
    Jiao, Ziheng
    Li, Xuelong
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 8
  • [44] Online based learning for predictive end-to-end network slicing in 5G networks
    Bouzidi, El Hocine
    Outtagarts, Abdelkader
    Hebbar, Abdelkrim
    Langar, Rami
    Boutaba, Raouf
    ICC 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2020,
  • [45] End-to-End Transition-Based Online Dialogue Disentanglement
    Liu, Hui
    Shi, Zhan
    Gu, Jia-Chen
    Liu, Quan
    Wei, Si
    Zhu, Xiaodan
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 3868 - 3874
  • [46] End-to-end online performance data capture and analysis for scientific workflows
    Papadimitriou, George
    Wang, Cong
    Vahi, Karan
    da Silva, Rafael Ferreira
    Mandal, Anirban
    Liu, Zhengchun
    Mayani, Rajiv
    Rynge, Mats
    Kiran, Mariam
    Lynch, Vickie E.
    Kettimuthu, Rajkumar
    Deelman, Ewa
    Vetter, Jeffrey S.
    Foster, Ian
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 117 : 387 - 400
  • [47] End-to-end online performance data capture and analysis for scientific workflows
    Papadimitriou, George
    Wang, Cong
    Vahi, Karan
    da Silva, Rafael Ferreira
    Mandal, Anirban
    Liu, Zhengchun
    Mayani, Rajiv
    Rynge, Mats
    Kiran, Mariam
    Lynch, Vickie E.
    Kettimuthu, Rajkumar
    Deelman, Ewa
    Vetter, Jeffrey S.
    Foster, Ian
    Future Generation Computer Systems, 2021, 117 : 387 - 400
  • [48] A software framework for end-to-end genomic sequence analysis with deep learning
    Klie, Adam
    Carter, Hannah
    NATURE COMPUTATIONAL SCIENCE, 2023, 3 (11): : 920 - 921
  • [49] HRNet: an end-to-end deep learning framework for digital holographic reconstruction
    Ren, Zhenbo
    Xu, Zhimin
    Lam, Edmund Y.
    ADVANCED PHOTONICS, 2019, 1 (01):
  • [50] An End-to-End Machine Learning Framework for Predicting Common Geriatric Diseases
    Jian Guo
    Yu Han
    Fan Xu
    Jiru Deng
    Zhe Li
    Journal of Beijing Institute of Technology, 2023, 32 (02) : 209 - 218