Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings

被引:0
|
作者
Gyawali, Bikash [1 ]
Anastasiou, Lucas [1 ]
Knoth, Petr [1 ]
机构
[1] Open Univ, Knowledge Media Inst, Milton Keynes, Bucks, England
关键词
Deduplication; Scholarly Documents; Locality Sensitive Hashing; Word Embeddings; Digital Repositories;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.
引用
收藏
页码:901 / 910
页数:10
相关论文
共 50 条
  • [21] Query by humming of MIDI and audio using locality sensitive hashing
    Ryynanen, Matti
    Klapuri, Anssi
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 2249 - 2252
  • [22] In Defense of Locality-Sensitive Hashing
    Ding, Kun
    Huo, Chunlei
    Fan, Bin
    Xiang, Shiming
    Pan, Chunhong
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (01) : 87 - 103
  • [23] Locality sensitive hashing with bit selection
    Zhou, Wenhua
    Liu, Huawen
    Lou, Jungang
    Chen, Xin
    APPLIED INTELLIGENCE, 2022, 52 (13) : 14724 - 14738
  • [24] ENTROPY BASED LOCALITY SENSITIVE HASHING
    Wang, Qiang
    Guo, Zhiyuan
    Liu, Gang
    Guo, Jun
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 1045 - 1048
  • [25] Locality sensitive hashing with bit selection
    Wenhua Zhou
    Huawen Liu
    Jungang Lou
    Xin Chen
    Applied Intelligence, 2022, 52 : 14724 - 14738
  • [26] Diverse Yet Efficient Retrieval using Locality Sensitive Hashing
    Rao, Vidyadhar
    Jain, Prateek
    Jawahar, C. V.
    ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 189 - 196
  • [27] Fast Redescription Mining Using Locality-Sensitive Hashing
    Karjalainen, Maiju
    Galbrun, Esther
    Miettinen, Pauli
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, PT VII, ECML PKDD 2024, 2024, 14947 : 124 - 142
  • [28] Compressing Locality Sensitive Hashing Tables
    Santoyo, Francisco
    Chavez, Edgar
    Tellez, Eric S.
    2013 MEXICAN INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE (ENC 2013), 2013, : 41 - 46
  • [29] Correlated Locality-Sensitive Hashing
    Pagh, Rasmus
    ALGORITHMS - ESA 2015, 2015, 9294
  • [30] Automatic Anonymization of Textual Documents: Detecting Sensitive Information via Word Embeddings
    Hassan, Fadi
    Sanchez, David
    Soria-Comas, Jordi
    Domingo-Ferrer, Josep
    2019 18TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS/13TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (TRUSTCOM/BIGDATASE 2019), 2019, : 358 - 365