An Effective and Cost-Based Framework for a Qualitative Hybrid Data Deduplication

被引:3
|
作者
Haruna, Charles R. [1 ,2 ]
Hou, MengShu [1 ]
Eghan, Moses J. [2 ]
Kpiebaareh, Michael Y. [1 ]
Tandoh, Lawrence [1 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu, Sichuan, Peoples R China
[2] Univ Cape Coast, Cape Coast, Ghana
来源
ADVANCES IN COMPUTER COMMUNICATION AND COMPUTATIONAL SCIENCES, IC4S 2018 | 2019年 / 924卷
关键词
Qualitative hybrid data deduplication; Edge-Pivot clustering; Entity resolution; Crowdsourcing;
D O I
10.1007/978-981-13-6861-5_44
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In real world, entities may occur several times in a database. These duplicates may have varying keys and/or include errors that make deduplication a difficult task. Deduplication cannot be solved accurately using either machine-based or crowdsourcing techniques only. Crowdsourcing were used to resolve the short-comings of machine-based approaches. Compared to machines, the crowd provided relatively accurate results, but with a slow execution time and very expensive too. A hybrid technique for data deduplication using a Euclidean distance and a chromatic correlation clustering algorithm was presented. The technique aimed at: reducing the crowdsourcing cost, reducing the time the crowd use in deduplication and finally providing higher accuracy in data deduplication. In the experiments, the proposed algorithm was compared with some existing techniques and outperformed some, offering an utmost deduplication accuracy efficiency and also incurring low crowdsourcing cost.
引用
收藏
页码:511 / 520
页数:10
相关论文
共 50 条
  • [1] Cost-Based and Effective Human-Machine Based Data Deduplication Model in Entity Reconciliation
    Haruna, Charles R.
    Hou, MengShu
    Eghan, Moses J.
    Kpiebaareh, Michael Y.
    Tandoh, Lawrence
    Eghan-Yartel, Barbie
    Asante-Mensah, Maame G.
    2018 5TH INTERNATIONAL CONFERENCE ON SYSTEMS AND INFORMATICS (ICSAI), 2018, : 1265 - 1270
  • [2] Towards a framework for cost-based transformation
    Skillicorn, DB
    JOURNAL OF SYSTEMS ARCHITECTURE, 1996, 42 (05) : 331 - 340
  • [3] A framework for cost-based feature selection
    Bolon-Canedo, V.
    Porto-Diaz, I.
    Sanchez-Marono, N.
    Alonso-Betanzos, A.
    PATTERN RECOGNITION, 2014, 47 (07) : 2481 - 2489
  • [4] Tempura: A General Cost-Based Optimizer Framework for Incremental Data Processing
    Wang, Zuozhi
    Zeng, Kai
    Huang, Botong
    Chen, Wei
    Cui, Xiaozong
    Wang, Bo
    Liu, Ji
    Fan, Liya
    Qu, Dachuan
    Hou, Zhenyu
    Guan, Tao
    Li, Chen
    Zhou, Jingren
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (01): : 14 - 27
  • [5] Spectrum Sensing Scheduling in a Cost-based Framework
    Kelkar, Aditya
    Cheng, Qi
    2012 CONFERENCE RECORD OF THE FORTY SIXTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS (ASILOMAR), 2012, : 1051 - 1055
  • [6] Data MVNO: Cost-based Pricing in Korea
    Kim, Byung-Woon
    Ko, Chang-Youl
    Kang, Sun-A
    PICMET '12: PROCEEDINGS - TECHNOLOGY MANAGEMENT FOR EMERGING TECHNOLOGIES, 2012, : 2785 - 2794
  • [7] Tempura: a general cost-based optimizer framework for incremental data processing (Journal Version)
    Wang, Zuozhi
    Zeng, Kai
    Huang, Botong
    Chen, Wei
    Cui, Xiaozong
    Wang, Bo
    Liu, Ji
    Fan, Liya
    Qu, Dachuan
    Hou, Zhenyu
    Guan, Tao
    Li, Chen
    Zhou, Jingren
    VLDB JOURNAL, 2023, 32 (06): : 1315 - 1342
  • [8] Tempura: a general cost-based optimizer framework for incremental data processing (Journal Version)
    Zuozhi Wang
    Kai Zeng
    Botong Huang
    Wei Chen
    Xiaozong Cui
    Bo Wang
    Ji Liu
    Liya Fan
    Dachuan Qu
    Zhenyu Hou
    Tao Guan
    Chen Li
    Jingren Zhou
    The VLDB Journal, 2023, 32 : 1315 - 1342
  • [9] A Novel Cost-Based Model for Data Repairing
    Hao, Shuang
    Tang, Nan
    Li, Guoliang
    He, Jian
    Ta, Na
    Feng, Jianhua
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (04) : 727 - 742
  • [10] COBRA: A Framework for Cost-Based Rewriting of Database Applications
    Emani, K. Venkatesh
    Sudarshan, S.
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 689 - 700