Distributed Data Deduplication

被引：33

作者：

Chu, Xu ^{[1
]}

Ilyas, Ihab F. ^{[1
]}

Koutris, Paraschos ^{[2
]}

机构：

[1] Univ Waterloo, Waterloo, ON N2L 3G1, Canada

[2] Univ Wisconsin Madison, Madison, WI 53706 USA

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2016年 / 9卷 / 11期

关键词：

D O I：

10.14778/2983200.2983203

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously non-duplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.

引用

页码：864 / 875

页数：12

共 50 条

[21] Efficient Deduplication in a Distributed Primary Storage Infrastructure
Paulo, Joao
Pereira, Jose
ACM TRANSACTIONS ON STORAGE, 2016, 12 (04)
[22] Data deduplication with edit errors
Conde-Canencia, Laura
Condie, Tyson
Dolecek, Lara
2018 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2018,
[23] Data Deduplication with Random Substitutions
Lou, Hao
Farnoud, Farzad
2020 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2020, : 2377 - 2382
[24] PerfectDedup: Secure Data Deduplication
Puzio, Pasquale
Molva, Refik
Onen, Melek
Loureiro, Sergio
DATA PRIVACY MANAGEMENT, AND SECURITY ASSURANCE, 2016, 9481 : 150 - 166
[25] A Global Survey on Data Deduplication
Singhal, Shubhanshi
Sharma, Pooja
Aggarwal, Rajesh Kumar
Passricha, Vishal
INTERNATIONAL JOURNAL OF GRID AND HIGH PERFORMANCE COMPUTING, 2018, 10 (04) : 43 - 66
[26] TiDedup: A New Distributed Deduplication Architecture for Ceph
Oh, Myoungwon
Lee, Sungmin
Just, Samuel
Yu, Young Jin
Bae, Duck-Ho
Weil, Sage
Cho, Sangyeun
Yeom, Heon Y.
PROCEEDINGS OF THE 2023 USENIX ANNUAL TECHNICAL CONFERENCE, 2023, : 117 - 131
[27] An Overview on Data Deduplication Techniques
Zhang, Xuecheng
Deng, Mingzhu
INFORMATION TECHNOLOGY AND INTELLIGENT TRANSPORTATION SYSTEMS, VOL 2, 2017, 455 : 359 - 369
[28] Transparent Data Deduplication in the Cloud
Armknecht, Frederik
Bohli, Jens-Matthias
Karame, Ghassan O.
Youssef, Franck
CCS'15: PROCEEDINGS OF THE 22ND ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2015, : 886 - 900
[29] Data Deduplication With Random Substitutions
Lou, Hao
Farnoud, Farzad
IEEE TRANSACTIONS ON INFORMATION THEORY, 2022, 68 (10) : 6941 - 6963
[30] Data Deduplication based on Hadoop
Zhang, Dongzhan
Liao, Chengfa
Yan, Wenjing
Tao, Ran
Zheng, Wei
2017 FIFTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2017, : 147 - 152

← 1 2 3 4 5 →