Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings

被引:0
|
作者
Gyawali, Bikash [1 ]
Anastasiou, Lucas [1 ]
Knoth, Petr [1 ]
机构
[1] Open Univ, Knowledge Media Inst, Milton Keynes, Bucks, England
关键词
Deduplication; Scholarly Documents; Locality Sensitive Hashing; Word Embeddings; Digital Repositories;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.
引用
收藏
页码:901 / 910
页数:10
相关论文
共 50 条
  • [41] Fast hierarchical clustering algorithm using locality-sensitive hashing
    Koga, H
    Ishibashi, T
    Watanabe, T
    DISCOVERY SCIENCE, PROCEEDINGS, 2004, 3245 : 114 - 128
  • [42] Detecting Frequent Patterns in Video Using Partly Locality Sensitive Hashing
    Ogawara, Koichi
    Tanabe, Yasufumi
    Kurazume, Ryo
    Hasegawa, Tsutomu
    COMPUTER VISION - ACCV 2010 WORKSHOPS, PT I, 2011, 6468 : 287 - 296
  • [43] CONSULT: accurate contamination removal using locality-sensitive hashing
    Rachtman, Eleonora
    Bafna, Vineet
    Mirarab, Siavash
    NAR GENOMICS AND BIOINFORMATICS, 2021, 3 (03)
  • [44] Faster compression methods for a weighted graph using locality sensitive hashing
    Khan, Kifayat Ullah
    Dolgorsuren, Batjargal
    Tu Nguyen Anh
    Nawaz, Waqas
    Lee, Young-Koo
    INFORMATION SCIENCES, 2017, 421 : 237 - 253
  • [45] Using Locality Sensitive Hashing to Improve the KNN Algorithm in the MapReduce Framework
    Bagui, Sikha
    Mondal, Arup Kumar
    Bagui, Subhash
    ACMSE '18: PROCEEDINGS OF THE ACMSE 2018 CONFERENCE, 2018,
  • [46] CONSULT-II: Taxonomic Identification Using Locality Sensitive Hashing
    Sapci, Ali Osman Berk
    Rachtman, Eleonora
    Mirarab, Siavash
    COMPARATIVE GENOMICS, RECOMB-CG 2023, 2023, 13883 : 196 - 214
  • [47] A Locality Sensitive Hashing Technique for Categorical Data
    Lee, Kyung Mi
    Lee, Keon Myung
    INDUSTRIAL INSTRUMENTATION AND CONTROL SYSTEMS, PTS 1-4, 2013, 241-244 : 3159 - 3164
  • [48] Neural Locality Sensitive Hashing for Entity Blocking
    Wang, Runhui
    Kong, Luyang
    Tao, Yefan
    Borthwick, Andrew
    Golac, Davor
    Johnson, Henrik
    Hijazi, Shadie
    Deng, Dong
    Zhang, Yongfeng
    Proceedings of the 2024 SIAM International Conference on Data Mining, SDM 2024, 2024, : 887 - 895
  • [49] BOUNDARY-EXPANDING LOCALITY SENSITIVE HASHING
    Wang, Qiang
    Guo, Zhiyuan
    Liu, Gang
    Guo, Jun
    2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 358 - 362
  • [50] Optimal Parameters for Locality-Sensitive Hashing
    Slaney, Malcolm
    Lifshits, Yury
    He, Junfeng
    PROCEEDINGS OF THE IEEE, 2012, 100 (09) : 2604 - 2623