Similarity based deduplication with small data chunks

被引：7

作者：

Aronovich, L. ^{[1
]}

Asher, R. ^{[2
]}

Harnik, D. ^{[2
]}

Hirsch, M. ^{[2
]}

Klein, S. T. ^{[3
]}

Toaff, Y. ^{[2
]}

机构：

[1] IBM Corp, Toronto, ON, Canada

[2] IBM Diligent, Tel Aviv, Israel

[3] Bar Ilan Univ, Dept Comp Sci, Ramat Gan, Israel

来源：

DISCRETE APPLIED MATHEMATICS | 2016年 / 212卷

关键词：

Deduplication; Similarity; Small data chunks; Approximate hashing;

D O I：

10.1016/j.dam.2015.09.018

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach. This paper deals with an extension of the second approach to systems in which it is still preferred to use small chunks. We describe the design choices made during the development of what we call an approximate hash function, serving as the basic tool of the new suggested deduplication system and report on extensive tests performed on a variety of large input files. (C) 2015 Elsevier B.V. All rights reserved.

引用

页码：10 / 22

页数：13

共 50 条

[1] Similarity Based Deduplication with Small Data Chunks
Aronovich, Lior
Asher, Ron
Harnik, Danny
Hirsch, Michael
Klein, Shmuel T.
Toaff, Yair
PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2012, 2012, : 3 - 17
[2] Optimal Partitioning of Data Chunks in Deduplication Systems
Hirsch, Michael
Ish-Shalom, Ariel
Klein, Shmuel T.
PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2013, 2013, : 128 - 141
[3] Optimal partitioning of data chunks in deduplication systems
Hirsch, M.
Ish-Shalom, A.
Klein, S. T.
DISCRETE APPLIED MATHEMATICS, 2016, 212 : 104 - 114
[4] Random chunks attachment strategy based secure deduplication for cloud data
Genghao L.
Ziji Z.
Xin T.
Yiteng Z.
Yuqi Z.
Tianyang Q.
Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2023, 50 (05): : 212 - 228
[5] Similarity and Locality Based Indexing for High Performance Data Deduplication
Xia, Wen
Jiang, Hong
Feng, Dan
Hua, Yu
IEEE TRANSACTIONS ON COMPUTERS, 2015, 64 (04) : 1162 - 1176
[6] Secure similarity-based cloud data deduplication in Ubiquitous city
Liu, Jinfeng
Wang, Jianfeng
Tao, Xiaoling
Shen, Jian
PERVASIVE AND MOBILE COMPUTING, 2017, 41 : 231 - 242
[7] Dynamic determination of variable sizes of chunks in a deduplication system
Hirsch, Michael
Klein, Shmuel T.
Shapira, Dana
Toaff, Yair
DISCRETE APPLIED MATHEMATICS, 2020, 274 : 81 - 91
[8] On evaluating text similarity measures for customer data deduplication
Boinski, Pawel
Sienkiewicz, Mariusz
Wrembel, Robert
Bebel, Bartosz
Andrzejewski, Witold
38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023, 2023, : 297 - 300
[9] SCAIL: Encrypted Deduplication with Segment Chunks and Index Locality
Ammons, Jaybe
Fenner, Trevor
Weston, David
2022 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE AND STORAGE (NAS), 2022, : 184 - 192
[10] GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data
Cogo, Vinicius
Paulo, Joao
Bessani, Alysson
IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (05) : 669 - 681

← 1 2 3 4 5 →