FDup: a framework for general-purpose and efficient entity deduplication of record collections

被引:0
|
作者
De Bonis M. [1 ]
Manghi P. [1 ]
Atzori C. [1 ]
机构
[1] Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo” (ISTI), Consiglio Nazionale delle Ricerche (CNR), Pisa
基金
欧盟地平线“2020”;
关键词
Deduplication; Scholarly communication;
D O I
10.7717/PEERJ-CS.1058
中图分类号
学科分类号
摘要
Deduplication is a technique aiming at identifying and resolving duplicate metadata records in a collection. This article describes FDup (Flat Collections Deduper), a general-purpose software framework supporting a complete deduplication workflow to manage big data record collections: metadata record data model definition, identification of candidate duplicates, identification of duplicates. FDup brings two main innovations: first, it delivers a full deduplication framework in a single easy-to-use software package based on Apache Spark Hadoop framework, where developers can customize the optimal and parallel workflow steps of blocking, sliding windows, and similarity matching function via an intuitive configuration file; second, it introduces a novel approach to improve performance, beyond the known techniques of "blocking" and "sliding window", by introducing a smart similarity matching function T-match. T-match is engineered as a decision tree that drives the comparisons of the fields of two records as branches of predicates and allows for successful or unsuccessful early-exit strategies. The efficacy of the approach is proved by experiments performed over big data collections of metadata records in the OpenAIRE Research Graph, a known open access knowledge base in Scholarly communication. © Copyright 2022 De Bonis et al.
引用
收藏
相关论文
共 50 条
  • [41] Almost-Orthogonal Layers for Efficient General-Purpose Lipschitz Networks
    Prach, Bernd
    Lampert, Christoph H.
    COMPUTER VISION, ECCV 2022, PT XXI, 2022, 13681 : 350 - 365
  • [42] Efficient general-purpose image compression with binary tree predictive coding
    Robinson, JA
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 1997, 6 (04) : 601 - 608
  • [43] Accurate and efficient general-purpose boilerplate detection for crawled web corpora
    Roland Schäfer
    Language Resources and Evaluation, 2017, 51 : 873 - 889
  • [44] An Efficient, General-Purpose Technique for Identifying Storm Cells in Geospatial Images
    Lakshmanan, Valliappa
    Hondl, Kurt
    Rabin, Robert
    JOURNAL OF ATMOSPHERIC AND OCEANIC TECHNOLOGY, 2009, 26 (03) : 523 - 537
  • [45] Accurate and efficient general-purpose boilerplate detection for crawled web corpora
    Schaefer, Roland
    LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (03) : 873 - 889
  • [46] EFFICIENT METHOD FOR COMPUTING NOISE IN A GENERAL-PURPOSE CAD PROGRAM IN APL
    ZEIN, DA
    CHANG, CS
    SELBO, KMA
    IEEE CIRCUITS & DEVICES, 1985, 1 (02): : 33 - 38
  • [47] SELP: A general-purpose framework for learning the norms from saliencies in spatiotemporal data
    Banerjee, Bonny
    Dutta, Jayanta K.
    NEUROCOMPUTING, 2014, 138 : 41 - 60
  • [48] FlexiDRAM: A Flexible in-DRAM Framework to Enable Parallel General-Purpose Computation
    Zhou, Ranyang
    Roohi, Arman
    Misra, Durga
    Angizi, Shaahin
    2022 ACM/IEEE INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN, ISLPED 2022, 2022,
  • [49] A general-purpose framework for parallel processing of large-scale LiDAR data
    Li, Zhenlong
    Hodgson, Michael E.
    Li, Wenwen
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2018, 11 (01) : 26 - 47
  • [50] A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud
    Zeng, Shulin
    Dai, Guohao
    Sun, Hanbo
    Liu, Jun
    Li, Shiyao
    Ge, Guangjun
    Zhong, Kai
    Guo, Kaiyuan
    Wang, Yu
    Yang, Huazhong
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2022, 15 (03)