Entity deduplication in big data graphs for scholarly communication

被引:4
|
作者
Manghi, Paolo [1 ]
Atzori, Claudio [1 ]
De Bonis, Michele [1 ]
Bardi, Alessia [1 ]
机构
[1] CNR, Ist Sci & Tecnol Informaz, Pisa, Italy
基金
欧盟地平线“2020”;
关键词
Deduplication; Information graphs; Big data; Scholarly communication; Scalability; Implementation;
D O I
10.1108/DTA-09-2019-0163
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose Several online services offer functionalities to access information from "big research graphs" (e.g. Google Scholar, OpenAIRE, Microsoft Academic Graph), which correlate scholarly/scientific communication entities such as publications, authors, datasets, organizations, projects, funders, etc. Depending on the target users, access can vary from search and browse content to the consumption of statistics for monitoring and provision of feedback. Such graphs are populated over time as aggregations of multiple sources and therefore suffer from major entity-duplication problems. Although deduplication of graphs is a known and actual problem, existing solutions are dedicated to specific scenarios, operate on flat collections, local topology-drive challenges and cannot therefore be re-used in other contexts. Design/methodology/approach This work presents GDup, an integrated, scalable, general-purpose system that can be customized to address deduplication over arbitrary large information graphs. The paper presents its high-level architecture, its implementation as a service used within the OpenAIRE infrastructure system and reports numbers of real-case experiments. Findings GDup provides the functionalities required to deliver a fully-fledged entity deduplication workflow over a generic input graph. The system offers out-of-the-box Ground Truth management, acquisition of feedback from data curators and algorithms for identifying and merging duplicates, to obtain an output disambiguated graph. Originality/value To our knowledge GDup is the only system in the literature that offers an integrated and general-purpose solution for the deduplication graphs, while targeting big data scalability issues. GDup is today one of the key modules of the OpenAIRE infrastructure production system, which monitors Open Science trends on behalf of the European Commission, National funders and institutions.
引用
收藏
页码:409 / 435
页数:27
相关论文
共 50 条
  • [31] De-duplicating the OpenAIRE Scholarly Communication Big Graph
    Atzori, Claudio
    Manghi, Paolo
    Bardi, Alessia
    [J]. 2018 IEEE 14TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE 2018), 2018, : 372 - 373
  • [32] RESEARCH BEYOND SCHOLARLY COMMUNICATION - THE BIG CHALLENGE OF SCIENTOMETRICS 2.0
    Glanzel, Wolfgang
    Chi, Pei-Shan
    [J]. 17TH INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS (ISSI2019), VOL I, 2019, : 424 - 436
  • [33] A Hybrid Data Deduplication Approach in Entity Resolution Using Chromatic Correlation Clustering
    Haruna, Charles R.
    Hou, Mengshu
    Eghan, Moses J.
    Kpiebaareh, Michael Y.
    Tandoh, Lawrence
    [J]. FRONTIERS IN CYBER SECURITY, 2018, 879 : 153 - 167
  • [34] Designing Framework for Precise Service of Scholarly Big Data
    Xie, Jing
    Qian, Li
    Shi, Hongbo
    Kong, Beibei
    Hu, Jiying
    [J]. Data Analysis and Knowledge Discovery, 2019, 3 (01): : 63 - 71
  • [35] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
    Rozinek, Ondrej
    Borkovcova, Monika
    Mares, Jan
    [J]. Lecture Notes in Networks and Systems, 2024, 990 LNNS : 181 - 191
  • [36] Searching for Evidence of Scientific News in Scholarly Big Data
    Ul Hoque, Md Reshad
    Bradley, Dash
    Kwan, Chiman
    Chiatti, Agnese
    Li, Jiang
    Wu, Jian
    [J]. PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE (K-CAP '19), 2019, : 251 - 254
  • [37] Design Considerations for a Sustainable Scholarly Big Data Service
    Wu, Jian
    Rohatgi, Shaurya
    Angadi, Manoj K.
    Puranik, Kavya S.
    Giles, C. Lee
    [J]. ACM International Conference Proceeding Series, 2022, : 83 - 87
  • [38] A Web Service for Scholarly Big Data Information Extraction
    Williams, Kyle
    Li, Lichi
    Khabsa, Madian
    Wu, Jian
    Shih, Patrick C.
    Giles, C. Lee
    [J]. 2014 IEEE 21ST INTERNATIONAL CONFERENCE ON WEB SERVICES (ICWS 2014), 2014, : 105 - 112
  • [39] Research Paper Recommender Systems on Big Scholarly Data
    Chen, Tsung Teng
    Lee, Maria
    [J]. KNOWLEDGE MANAGEMENT AND ACQUISITION FOR INTELLIGENT SYSTEMS (PKAW 2018), 2018, 11016 : 251 - 260
  • [40] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
    Rozinek, Ondrej
    Borkovcova, Monika
    Mares, Jan
    [J]. GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 6, WORLDCIST 2024, 2024, 990 : 181 - 191