Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings

被引:0
|
作者
Gyawali, Bikash [1 ]
Anastasiou, Lucas [1 ]
Knoth, Petr [1 ]
机构
[1] Open Univ, Knowledge Media Inst, Milton Keynes, Bucks, England
关键词
Deduplication; Scholarly Documents; Locality Sensitive Hashing; Word Embeddings; Digital Repositories;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.
引用
下载
收藏
页码:901 / 910
页数:10
相关论文
共 50 条
  • [1] Fast Video Deduplication via Locality Sensitive Hashing with Similarity Ranking
    Li, Yeguang
    Xia, Ke
    8TH INTERNATIONAL CONFERENCE ON INTERNET MULTIMEDIA COMPUTING AND SERVICE (ICIMCS2016), 2016, : 94 - 98
  • [2] LSHvec: A Vector Representation of DNA Sequences Using Locality Sensitive Hashing and FastTextWord Embeddings
    Shi, Lizhen
    Chen, Bo
    12TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS (ACM-BCB 2021), 2021,
  • [3] Locality Sensitive Hashing Using GMM
    Schmieder, Fabian
    Yang, Bin
    PATTERN RECOGNITION, GCPR 2014, 2014, 8753 : 569 - 581
  • [4] Fast distributed video deduplication via locality-sensitive hashing with similarity ranking
    Yeguang Li
    Liang Hu
    Ke Xia
    Jie Luo
    EURASIP Journal on Image and Video Processing, 2019
  • [5] Fast distributed video deduplication via locality-sensitive hashing with similarity ranking
    Li, Yeguang
    Hu, Liang
    Xia, Ke
    Luo, Jie
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2019, 2019 (1)
  • [6] Locality Sensitive Hashing for Scalable Structural Classification and Clustering of Web Documents
    Hachenberg, Christian
    Gottron, Thomas
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 359 - 368
  • [7] Raga Identification Using Locality Sensitive Hashing
    Padmasundari, G.
    Murthy, Hema A.
    2017 TWENTY-THIRD NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2017,
  • [8] Dynamic Whitelisting Using Locality Sensitive Hashing
    Pryde, Jayson
    Angeles, Nestle
    Carinan, Sheryl Kareen
    TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING: PAKDD 2018 WORKSHOPS, 2018, 11154 : 181 - 185
  • [9] A Fast Word Retrieval Technique Based on Kernelized Locality Sensitive Hashing
    Mondal, Tanmoy
    Ragot, Nicolas
    Ramel, Jean-Yves
    Pal, Umapada
    2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 1195 - 1199
  • [10] ON THE DISTORTION OF LOCALITY SENSITIVE HASHING
    Chierichetti, Flavio
    Kumar, Ravi
    Panconesi, Alessandro
    Terolli, Erisa
    SIAM JOURNAL ON COMPUTING, 2019, 48 (02) : 350 - 372