Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings

被引:0
|
作者
Gyawali, Bikash [1 ]
Anastasiou, Lucas [1 ]
Knoth, Petr [1 ]
机构
[1] Open Univ, Knowledge Media Inst, Milton Keynes, Bucks, England
关键词
Deduplication; Scholarly Documents; Locality Sensitive Hashing; Word Embeddings; Digital Repositories;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.
引用
收藏
页码:901 / 910
页数:10
相关论文
共 50 条
  • [31] Visual Summarization of Scholarly Videos Using Word Embeddings and Keyphrase Extraction
    Zhou, Hang
    Otto, Christian
    Ewerth, Ralph
    DIGITAL LIBRARIES FOR OPEN KNOWLEDGE, TPDL 2019, 2019, 11799 : 327 - 335
  • [32] Term Extraction from Medical Documents Using Word Embeddings
    Bay, Matthias
    Bruness, Daniel
    Herold, Miriam
    Schulze, Christian
    Guckert, Michael
    Minor, Mirj Am
    2020 6TH IEEE CONGRESS ON INFORMATION SCIENCE AND TECHNOLOGY (IEEE CIST'20), 2020, : 328 - 333
  • [33] LSHWE: Improving Similarity-Based Word Embedding with Locality Sensitive Hashing for Cyberbullying Detection
    Zhao, Zehua
    Gao, Min
    Luo, Fengji
    Zhang, Yi
    Xiong, Qingyu
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [34] Fast Fuzzy Search for Mixed Data Using Locality Sensitive Hashing
    Lee, Kyung Mi
    Lee, Keon Myung
    PROGRESS IN MECHATRONICS AND INFORMATION TECHNOLOGY, PTS 1 AND 2, 2014, 462-463 : 321 - +
  • [35] Frequent-Itemset Mining Using Locality-Sensitive Hashing
    Bera, Debajyoti
    Pratap, Rameshwar
    COMPUTING AND COMBINATORICS, COCOON 2016, 2016, 9797 : 143 - 155
  • [36] EFFICIENT MANIFOLD LEARNING FOR SPEECH RECOGNITION USING LOCALITY SENSITIVE HASHING
    Tomar, Vikrant Singh
    Rose, Richard C.
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 6995 - 6999
  • [37] Locality Sensitive Hashing for Satellite Images using Texture Feature Vectors
    Buaba, Ruben
    Gebril, Mohamed
    Homaifar, Abdollah
    Kihn, Eric
    Zhizhin, Mikhail
    2010 IEEE AEROSPACE CONFERENCE PROCEEDINGS, 2010,
  • [38] Ultrafast Genomic Database Search using Layered Locality Sensitive Hashing
    Chakraborty, Angana
    Bandyopadhyay, Sanghamitra
    PROCEEDINGS OF 2018 FIFTH INTERNATIONAL CONFERENCE ON EMERGING APPLICATIONS OF INFORMATION TECHNOLOGY (EAIT), 2018,
  • [39] Similar Pair Identification using Locality-Sensitive Hashing Technique
    Lee, Kyung Mi
    Lee, Keon Myung
    6TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS, AND THE 13TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS, 2012, : 2117 - 2119
  • [40] Improving the Performance of kNN in the MapReduce Framework Using Locality Sensitive Hashing
    Bagui, Sikha
    Mondal, Arup Kumar
    Bagui, Subhash
    INTERNATIONAL JOURNAL OF DISTRIBUTED SYSTEMS AND TECHNOLOGIES, 2019, 10 (04) : 1 - 16