Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data

被引:0
|
作者
Rozinek, Ondrej [1 ]
Borkovcova, Monika [2 ]
Mares, Jan [1 ,3 ]
机构
[1] Department of Process Control, University of Pardubice, Studentska 95, Pardubice,532 10, Czech Republic
[2] Department of Information Technology, University of Pardubice, Studentska 95, Pardubice,532 10, Czech Republic
[3] Department of Mathematics, Informatics and Cybernetics, University of Chemistry and Technology Prague, Technicka 5, Prague,166 28, Czech Republic
来源
关键词
Engineering Village;
D O I
暂无
中图分类号
学科分类号
摘要
Bipartite matchings - Data-source - Deduplication - Entity resolutions - Matchings - Q-gram filters - Record deduplication - Record linkage - Similarity join - Similarity spaces
引用
收藏
页码:181 / 191
相关论文
共 50 条
  • [31] BIGMiner: a fast and scalable distributed frequent pattern miner for big data
    Kang-Wook Chon
    Min-Soo Kim
    Cluster Computing, 2018, 21 : 1507 - 1520
  • [32] MassJoin: A MapReduce-based Method for Scalable String Similarity Joins
    Deng, Dong
    Li, Guoliang
    Hao, Shuang
    Wang, Jiannan
    Feng, Jianhua
    2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 340 - 351
  • [33] On evaluating text similarity measures for customer data deduplication
    Boinski, Pawel
    Sienkiewicz, Mariusz
    Wrembel, Robert
    Bebel, Bartosz
    Andrzejewski, Witold
    38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023, 2023, : 297 - 300
  • [34] Characterizing the Efficiency of Data Deduplication for Big Data Storage Management
    Zhou, Ruijin
    Liu, Ming
    Li, Tao
    2013 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2013), 2013, : 98 - 108
  • [35] A Bloom Filter-Based Data Deduplication for Big Data
    Podder, Shrayasi
    Mukherjee, S.
    ADVANCES IN DATA AND INFORMATION SCIENCES, VOL 1, 2018, 38 : 161 - 168
  • [36] Entity deduplication in big data graphs for scholarly communication
    Manghi, Paolo
    Atzori, Claudio
    De Bonis, Michele
    Bardi, Alessia
    DATA TECHNOLOGIES AND APPLICATIONS, 2020, 54 (04) : 409 - 435
  • [37] Boafft: Distributed Deduplication for Big Data Storage in the Cloud
    Luo, Shengmei
    Zhang, Guangyan
    Wu, Chengwen
    Khan, Samee U.
    Li, Keqin
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2020, 8 (04) : 1199 - 1211
  • [38] A Scalable Data Chunk Similarity Based Compression Approach for Efficient Big Sensing Data Processing on Cloud
    Yang, Chi
    Chen, Jinjun
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (06) : 1144 - 1157
  • [39] Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data
    Rheinlaender, Astrid
    Knobloch, Martin
    Hochmuth, Nicky
    Leser, Ulf
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2010, 6187 : 519 - 536
  • [40] Intelligent Similary Joins for Big Data Integration
    Wang, Mian
    Nie, Tiezheng
    Shen, Derong
    Kou, Yue
    Yu, Ge
    2013 10TH WEB INFORMATION SYSTEM AND APPLICATION CONFERENCE (WISA 2013), 2013, : 383 - 388