Distributed Data Deduplication

被引:33
|
作者
Chu, Xu [1 ]
Ilyas, Ihab F. [1 ]
Koutris, Paraschos [2 ]
机构
[1] Univ Waterloo, Waterloo, ON N2L 3G1, Canada
[2] Univ Wisconsin Madison, Madison, WI 53706 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2016年 / 9卷 / 11期
关键词
D O I
10.14778/2983200.2983203
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously non-duplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.
引用
收藏
页码:864 / 875
页数:12
相关论文
共 50 条
  • [21] Efficient Deduplication in a Distributed Primary Storage Infrastructure
    Paulo, Joao
    Pereira, Jose
    ACM TRANSACTIONS ON STORAGE, 2016, 12 (04)
  • [22] Data deduplication with edit errors
    Conde-Canencia, Laura
    Condie, Tyson
    Dolecek, Lara
    2018 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2018,
  • [23] Data Deduplication with Random Substitutions
    Lou, Hao
    Farnoud, Farzad
    2020 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2020, : 2377 - 2382
  • [24] PerfectDedup: Secure Data Deduplication
    Puzio, Pasquale
    Molva, Refik
    Onen, Melek
    Loureiro, Sergio
    DATA PRIVACY MANAGEMENT, AND SECURITY ASSURANCE, 2016, 9481 : 150 - 166
  • [25] A Global Survey on Data Deduplication
    Singhal, Shubhanshi
    Sharma, Pooja
    Aggarwal, Rajesh Kumar
    Passricha, Vishal
    INTERNATIONAL JOURNAL OF GRID AND HIGH PERFORMANCE COMPUTING, 2018, 10 (04) : 43 - 66
  • [26] TiDedup: A New Distributed Deduplication Architecture for Ceph
    Oh, Myoungwon
    Lee, Sungmin
    Just, Samuel
    Yu, Young Jin
    Bae, Duck-Ho
    Weil, Sage
    Cho, Sangyeun
    Yeom, Heon Y.
    PROCEEDINGS OF THE 2023 USENIX ANNUAL TECHNICAL CONFERENCE, 2023, : 117 - 131
  • [27] An Overview on Data Deduplication Techniques
    Zhang, Xuecheng
    Deng, Mingzhu
    INFORMATION TECHNOLOGY AND INTELLIGENT TRANSPORTATION SYSTEMS, VOL 2, 2017, 455 : 359 - 369
  • [28] Transparent Data Deduplication in the Cloud
    Armknecht, Frederik
    Bohli, Jens-Matthias
    Karame, Ghassan O.
    Youssef, Franck
    CCS'15: PROCEEDINGS OF THE 22ND ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2015, : 886 - 900
  • [29] Data Deduplication With Random Substitutions
    Lou, Hao
    Farnoud, Farzad
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2022, 68 (10) : 6941 - 6963
  • [30] Data Deduplication based on Hadoop
    Zhang, Dongzhan
    Liao, Chengfa
    Yan, Wenjing
    Tao, Ran
    Zheng, Wei
    2017 FIFTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2017, : 147 - 152