Distributed Data Deduplication

被引:33
|
作者
Chu, Xu [1 ]
Ilyas, Ihab F. [1 ]
Koutris, Paraschos [2 ]
机构
[1] Univ Waterloo, Waterloo, ON N2L 3G1, Canada
[2] Univ Wisconsin Madison, Madison, WI 53706 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2016年 / 9卷 / 11期
关键词
D O I
10.14778/2983200.2983203
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously non-duplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.
引用
收藏
页码:864 / 875
页数:12
相关论文
共 50 条
  • [1] Droplet: a Distributed Solution of Data Deduplication
    Zhang, Yang
    Wu, Yongwei
    Yang, Guangwen
    2012 ACM/IEEE 13TH INTERNATIONAL CONFERENCE ON GRID COMPUTING (GRID), 2012, : 114 - 121
  • [2] Boafft: Distributed Deduplication for Big Data Storage in the Cloud
    Luo, Shengmei
    Zhang, Guangyan
    Wu, Chengwen
    Khan, Samee U.
    Li, Keqin
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2020, 8 (04) : 1199 - 1211
  • [3] Genetic Optimized Data Deduplication for Distributed Big Data Storage Systems
    Kumar, Naresh
    Antwal, Shobha
    Samarthyam, Ganesh
    Jain, S. C.
    PROCEEDINGS OF 4TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMPUTING AND CONTROL (ISPCC 2K17), 2017, : 7 - 15
  • [4] Inline Data Deduplication for SSD-based Distributed Storage
    Zhang, Binqi
    Wang, Chen
    Zhou, Bing Bing
    Zomaya, Albert Y.
    2015 IEEE 21ST INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2015, : 593 - 600
  • [5] GDedup: Distributed File System Level Deduplication for Genomic Big Data
    Bartus, Paul
    Arzuaga, Emmanuel
    2018 IEEE INTERNATIONAL CONGRESS ON BIG DATA (IEEE BIGDATA CONGRESS), 2018, : 120 - 127
  • [6] TurboSockets: Democratizing Distributed Deduplication
    Salada, Joao
    Barreto, Joao
    2013 12TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2013), 2013, : 1291 - 1298
  • [7] Privacy-Preserving Deduplication of Sensor Compressed Data in Distributed Fog Computing
    Zhang, Chen
    Miao, Yinbin
    Xie, Qingyuan
    Guo, Yu
    Du, Hongwei
    Jia, Xiaohua
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (12) : 4176 - 4191
  • [8] Distributed deduplication with fingerprint index management model for big data storage in the cloud
    S. Sabeetha Saraswathi
    N. Malarvizhi
    Evolutionary Intelligence, 2021, 14 : 683 - 690
  • [9] Even Data Placement for Load Balance in Reliable Distributed Deduplication Storage Systems
    Xu, Min
    Zhu, Yunfeng
    Lee, Patrick P. C.
    Xu, Yinlong
    2015 IEEE 23RD INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2015, : 349 - 358
  • [10] A Scheme for Data Deduplication Using Advance Machine Learning Architecture in Distributed Systems
    Tarun, Sashi
    Batth, Ranbir Singh
    Kaur, Sukhpreet
    2021 INTERNATIONAL CONFERENCE ON COMPUTING SCIENCES (ICCS 2021), 2021, : 53 - 60