Entity deduplication in big data graphs for scholarly communication

被引:4
|
作者
Manghi, Paolo [1 ]
Atzori, Claudio [1 ]
De Bonis, Michele [1 ]
Bardi, Alessia [1 ]
机构
[1] CNR, Ist Sci & Tecnol Informaz, Pisa, Italy
基金
欧盟地平线“2020”;
关键词
Deduplication; Information graphs; Big data; Scholarly communication; Scalability; Implementation;
D O I
10.1108/DTA-09-2019-0163
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose Several online services offer functionalities to access information from "big research graphs" (e.g. Google Scholar, OpenAIRE, Microsoft Academic Graph), which correlate scholarly/scientific communication entities such as publications, authors, datasets, organizations, projects, funders, etc. Depending on the target users, access can vary from search and browse content to the consumption of statistics for monitoring and provision of feedback. Such graphs are populated over time as aggregations of multiple sources and therefore suffer from major entity-duplication problems. Although deduplication of graphs is a known and actual problem, existing solutions are dedicated to specific scenarios, operate on flat collections, local topology-drive challenges and cannot therefore be re-used in other contexts. Design/methodology/approach This work presents GDup, an integrated, scalable, general-purpose system that can be customized to address deduplication over arbitrary large information graphs. The paper presents its high-level architecture, its implementation as a service used within the OpenAIRE infrastructure system and reports numbers of real-case experiments. Findings GDup provides the functionalities required to deliver a fully-fledged entity deduplication workflow over a generic input graph. The system offers out-of-the-box Ground Truth management, acquisition of feedback from data curators and algorithms for identifying and merging duplicates, to obtain an output disambiguated graph. Originality/value To our knowledge GDup is the only system in the literature that offers an integrated and general-purpose solution for the deduplication graphs, while targeting big data scalability issues. GDup is today one of the key modules of the OpenAIRE infrastructure production system, which monitors Open Science trends on behalf of the European Commission, National funders and institutions.
引用
收藏
页码:409 / 435
页数:27
相关论文
共 50 条
  • [1] GDup: De-duplication of Scholarly Communication Big Graphs
    Atzori, Claudio
    Manghi, Paolo
    Bardi, Alessia
    [J]. 2018 IEEE/ACM 5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING APPLICATIONS AND TECHNOLOGIES (BDCAT), 2018, : 142 - 151
  • [2] Deduplication on Encrypted Big Data in Cloud
    Yan, Zheng
    Ding, Wenxiu
    Yu, Xixun
    Zhu, Haiqi
    Deng, Robert H.
    [J]. IEEE Transactions on Big Data, 2016, 2 (02): : 138 - 150
  • [3] Scholarly knowledge graphs through structuring scholarly communication: a review
    Verma, Shilpa
    Bhatia, Rajesh
    Harit, Sandeep
    Batish, Sanjay
    [J]. COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (01) : 1059 - 1095
  • [4] Scholarly knowledge graphs through structuring scholarly communication: a review
    Shilpa Verma
    Rajesh Bhatia
    Sandeep Harit
    Sanjay Batish
    [J]. Complex & Intelligent Systems, 2023, 9 : 1059 - 1095
  • [5] Entity Deduplication on ScholarlyData
    Zhang, Ziqi
    Nuzzolese, Andrea Giovanni
    Gentile, Anna Lisa
    [J]. SEMANTIC WEB ( ESWC 2017), PT I, 2017, 10249 : 85 - 100
  • [6] A Bloom Filter-Based Data Deduplication for Big Data
    Podder, Shrayasi
    Mukherjee, S.
    [J]. ADVANCES IN DATA AND INFORMATION SCIENCES, VOL 1, 2018, 38 : 161 - 168
  • [7] Characterizing the Efficiency of Data Deduplication for Big Data Storage Management
    Zhou, Ruijin
    Liu, Ming
    Li, Tao
    [J]. 2013 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2013), 2013, : 98 - 108
  • [8] Guest Editorial: Scholarly Big Data
    Xia, Feng
    Giles, C. Lee
    Liu, Huan
    Wang, Kuansan
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2021, 9 (01) : 200 - 203
  • [9] Entity Resolution for Big Data
    Getoor, Lise
    Machanavajjhala, Ashwin
    [J]. 19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, : 1525 - 1525
  • [10] Scholarly Big Data Knowledge and Semantics
    Giles, C. Lee
    [J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16 COMPANION), 2016, : 371 - 371