FDup: a framework for general-purpose and efficient entity deduplication of record collections

被引:2
|
作者
De Bonis, Michele [1 ]
Manghi, Paolo [1 ]
Atzori, Claudio [1 ]
机构
[1] CNR, Ist Sci & Tecnol Informaz A Faedo ISTI, Pisa, Italy
关键词
Deduplication; Scholarly communication;
D O I
10.7717/peerj-cs.1058
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deduplication is a technique aiming at identifying and resolving duplicate metadata records in a collection. This article describes FDup (Flat Collections Deduper), a general-purpose software framework supporting a complete deduplication workflow to manage big data record collections: metadata record data model definition, iden-tification of candidate duplicates, identification of duplicates. FDup brings two main innovations: first, it delivers a full deduplication framework in a single easy-to-use software package based on Apache Spark Hadoop framework, where developers can customize the optimal and parallel workflow steps of blocking, sliding windows, and similarity matching function via an intuitive configuration file; second, it introduces a novel approach to improve performance, beyond the known techniques of "blocking"and "sliding window", by introducing a smart similarity matching function T-match. T-match is engineered as a decision tree that drives the comparisons of the fields of two records as branches of predicates and allows for successful or unsuccessful early-exit strategies. The efficacy of the approach is proved by experiments performed over big data collections of metadata records in the OpenAIRE Research Graph, a known open access knowledge base in Scholarly communication.
引用
收藏
页数:23
相关论文
共 50 条
  • [22] A general-purpose mobile framework for situated learning services on PDA
    Kim, Seong Baeg
    Yang, Kyoung Mi
    Kim, Cheol Min
    COMPUTATIONAL SCIENCE - ICCS 2007, PT 3, PROCEEDINGS, 2007, 4489 : 559 - +
  • [23] A general-purpose process modelling framework for marine energy systems
    Dimopoulos, George G.
    Georgopoulou, Chariklia A.
    Stefanatos, Iason C.
    Zymaris, Alexandros S.
    Kakalis, Nikolaos M. P.
    ENERGY CONVERSION AND MANAGEMENT, 2014, 86 : 325 - 339
  • [24] PrefixFPM: A Parallel Framework for General-Purpose Frequent Pattern Mining
    Yan, Da
    Qu, Wenwen
    Guo, Guimu
    Wang, Xiaoling
    2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, : 1938 - 1941
  • [25] A general-purpose framework for FPGA-accelerated genetic algorithms
    Guo, Liucheng
    Funie, Andreea Ingrid
    Xie, Zhongliu
    Thomas, David
    Luk, Wayne
    INTERNATIONAL JOURNAL OF BIO-INSPIRED COMPUTATION, 2015, 7 (06) : 361 - 375
  • [26] Kitsune: Efficient, General-purpose Dynamic Software Updating for C
    Hayden, Christopher M.
    Smith, Edward K.
    Denchev, Michail
    Hicks, Michael
    Foster, Jeffrey S.
    ACM SIGPLAN NOTICES, 2012, 47 (10) : 249 - 264
  • [27] TENG: A General-Purpose and Efficient Processor Architecture for Accelerating DNN
    Zhang, Zekun
    Cai, Yujie
    Liao, Tianjiao
    Xu, Chengyu
    Jiao, Xin
    2024 IEEE 6TH INTERNATIONAL CONFERENCE ON AI CIRCUITS AND SYSTEMS, AICAS 2024, 2024, : 149 - 153
  • [28] Kitsune: Efficient, General-Purpose Dynamic Software Updating for C
    Hayden, Christopher M.
    Saur, Karla
    Smith, Edward K.
    Hicks, Michael
    Foster, Jeffrey S.
    ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 2014, 36 (04): : 1 - 38
  • [29] Novel Efficient HEVC Decoding Solution on General-Purpose Processors
    Duan, Yizhou
    Sun, Jun
    Yan, Leju
    Chen, Keji
    Guo, Zongming
    IEEE TRANSACTIONS ON MULTIMEDIA, 2014, 16 (07) : 1915 - 1928
  • [30] Towards Efficient Processing of General-Purpose Joins in Sensor Networks
    Stern, Mirco
    Buchmann, Erik
    Boehm, Klemens
    ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 126 - 137