An Effective and Cost-Based Framework for a Qualitative Hybrid Data Deduplication

被引:3
|
作者
Haruna, Charles R. [1 ,2 ]
Hou, MengShu [1 ]
Eghan, Moses J. [2 ]
Kpiebaareh, Michael Y. [1 ]
Tandoh, Lawrence [1 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu, Sichuan, Peoples R China
[2] Univ Cape Coast, Cape Coast, Ghana
来源
ADVANCES IN COMPUTER COMMUNICATION AND COMPUTATIONAL SCIENCES, IC4S 2018 | 2019年 / 924卷
关键词
Qualitative hybrid data deduplication; Edge-Pivot clustering; Entity resolution; Crowdsourcing;
D O I
10.1007/978-981-13-6861-5_44
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In real world, entities may occur several times in a database. These duplicates may have varying keys and/or include errors that make deduplication a difficult task. Deduplication cannot be solved accurately using either machine-based or crowdsourcing techniques only. Crowdsourcing were used to resolve the short-comings of machine-based approaches. Compared to machines, the crowd provided relatively accurate results, but with a slow execution time and very expensive too. A hybrid technique for data deduplication using a Euclidean distance and a chromatic correlation clustering algorithm was presented. The technique aimed at: reducing the crowdsourcing cost, reducing the time the crowd use in deduplication and finally providing higher accuracy in data deduplication. In the experiments, the proposed algorithm was compared with some existing techniques and outperformed some, offering an utmost deduplication accuracy efficiency and also incurring low crowdsourcing cost.
引用
收藏
页码:511 / 520
页数:10
相关论文
共 50 条
  • [31] A cost-based pricing analysis
    Katsigiannis, Michail
    2014 1ST INTERNATIONAL CONFERENCE ON 5G FOR UBIQUITOUS CONNECTIVITY (5GU), 2014, : 264 - 266
  • [32] COST-BASED ACCEPTANCE SAMPLING
    CASE, KE
    BENNETT, GK
    SCHMIDT, JW
    INDUSTRIAL ENGINEERING, 1972, 4 (11): : 26 - &
  • [33] Cost-based temporal reasoning
    Santos, Eugene, Jr.
    INFORMATION SCIENCES, 2019, 482 : 392 - 418
  • [34] Cost-based transfer pricing
    Thomas Pfeiffer
    Ulf Schiller
    Joachim Wagner
    Review of Accounting Studies, 2011, 16 : 219 - 246
  • [35] Elements of Cost-Based Tolerancing
    Richard N. Youngworth
    Bryan D. Stone
    Optical Review, 2001, 8 : 276 - 280
  • [36] Cost-based Database Scaling
    Orugnati, V. S. Srujana
    2017 7TH IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE (IACC), 2017, : 895 - 900
  • [37] Cost-based transfer pricing
    Pfeiffer, Thomas
    Schiller, Ulf
    Wagner, Joachim
    REVIEW OF ACCOUNTING STUDIES, 2011, 16 (02) : 219 - 246
  • [38] Cost-based domain filtering
    Focacci, F
    Lodi, A
    Milano, M
    PRINCIPLES AND PRACTICE OF CONSTRAINT PROGRAMMING-CP'99, 1999, 1713 : 189 - 203
  • [39] Cost-based or Learning-based? A Hybrid Query Optimizer for Query Plan Selection
    Yu, Xiang
    Chai, Chengliang
    Li, Guoliang
    Liu, Jiabin
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 15 (13): : 3924 - 3936
  • [40] Cost-based recommendation of parameters for local differentially private data aggregation
    Shahani, Snehkumar
    Venkateswaran, R.
    Abraham, Jibi
    COMPUTERS & SECURITY, 2021, 102