Similarity based deduplication with small data chunks

被引：7

作者：

Aronovich, L. ^{[1
]}

Asher, R. ^{[2
]}

Harnik, D. ^{[2
]}

Hirsch, M. ^{[2
]}

Klein, S. T. ^{[3
]}

Toaff, Y. ^{[2
]}

机构：

[1] IBM Corp, Toronto, ON, Canada

[2] IBM Diligent, Tel Aviv, Israel

[3] Bar Ilan Univ, Dept Comp Sci, Ramat Gan, Israel

来源：

DISCRETE APPLIED MATHEMATICS | 2016年 / 212卷

关键词：

Deduplication; Similarity; Small data chunks; Approximate hashing;

D O I：

10.1016/j.dam.2015.09.018

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach. This paper deals with an extension of the second approach to systems in which it is still preferred to use small chunks. We describe the design choices made during the development of what we call an approximate hash function, serving as the basic tool of the new suggested deduplication system and report on extensive tests performed on a variety of large input files. (C) 2015 Elsevier B.V. All rights reserved.

引用

页码：10 / 22

页数：13

共 50 条

[31] A Novel Frequency Based Chunking for Data Deduplication
Zhang, Yunhe
Wang, Weiling
Yin, Ting
Yuan, Jiang
ADVANCES IN MECHATRONICS AND CONTROL ENGINEERING, PTS 1-3, 2013, 278-280 : 2048 - 2053
[32] A cluster-based data deduplication technology
Tseng, Chuan-Mu
Ciou, Jheng-Rong
Liu, Tzong-Jye
2014 SECOND INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2014, : 226 - 230
[33] A Cloud Based Model for Deduplication of Large Data
Kirubakaran, R.
Prathibhan, Mano C.
Karthika, C.
2015 IEEE INTERNATIONAL CONFERENCE ON ENGINEERING AND TECHNOLOGY (ICETECH), 2015, : 145 - 148
[34] A Bloom Filter-Based Data Deduplication for Big Data
Podder, Shrayasi
Mukherjee, S.
ADVANCES IN DATA AND INFORMATION SCIENCES, VOL 1, 2018, 38 : 161 - 168
[35] Random Chunks Generation Attack Resistant Cross-User Deduplication for Cloud Storage
Tang, Xin
Zhou, Yiteng
Zhu, Yudan
Fu, Mingjun
Jin, Luchao
2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 1022 - 1030
[36] ESDedup: An efficient and secure deduplication scheme based on data similarity and blockchain for cloud-assisted medical storage systems
Ling Xiao
Beiji Zou
Chengzhang Zhu
Fanbo Nie
The Journal of Supercomputing, 2023, 79 : 2932 - 2960
[37] ESDedup: An efficient and secure deduplication scheme based on data similarity and blockchain for cloud-assisted medical storage systems
Xiao, Ling
Zou, Beiji
Zhu, Chengzhang
Nie, Fanbo
JOURNAL OF SUPERCOMPUTING, 2023, 79 (03): : 2932 - 2960
[38] A Privacy Protection Mechanism for NoSql Database Based on Data Chunks
Sun, Shibin
Shi, Yuliang
Zhang, Shidong
Cui, Lizhen
2016 IEEE TRUSTCOM/BIGDATASE/ISPA, 2016, : 829 - 836
[39] Hadoop Based Scalable Cluster Deduplication for Big Data
Liu, Qing
Fu, Yinjin
Ni, Guiqiang
Hou, Rui
2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW 2016), 2016, : 98 - 105
[40] A secure data deduplication scheme based on differential privacy
Ren, Jun
Yao, Zhiqiang
Xiong, Jinbo
Zhang, Yuanyuan
Ye, Ayong
2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 1241 - 1246

← 1 2 3 4 5 →