Similarity based deduplication with small data chunks

被引:7
|
作者
Aronovich, L. [1 ]
Asher, R. [2 ]
Harnik, D. [2 ]
Hirsch, M. [2 ]
Klein, S. T. [3 ]
Toaff, Y. [2 ]
机构
[1] IBM Corp, Toronto, ON, Canada
[2] IBM Diligent, Tel Aviv, Israel
[3] Bar Ilan Univ, Dept Comp Sci, Ramat Gan, Israel
关键词
Deduplication; Similarity; Small data chunks; Approximate hashing;
D O I
10.1016/j.dam.2015.09.018
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach. This paper deals with an extension of the second approach to systems in which it is still preferred to use small chunks. We describe the design choices made during the development of what we call an approximate hash function, serving as the basic tool of the new suggested deduplication system and report on extensive tests performed on a variety of large input files. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:10 / 22
页数:13
相关论文
共 50 条
  • [1] Similarity Based Deduplication with Small Data Chunks
    Aronovich, Lior
    Asher, Ron
    Harnik, Danny
    Hirsch, Michael
    Klein, Shmuel T.
    Toaff, Yair
    PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2012, 2012, : 3 - 17
  • [2] Optimal Partitioning of Data Chunks in Deduplication Systems
    Hirsch, Michael
    Ish-Shalom, Ariel
    Klein, Shmuel T.
    PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2013, 2013, : 128 - 141
  • [3] Optimal partitioning of data chunks in deduplication systems
    Hirsch, M.
    Ish-Shalom, A.
    Klein, S. T.
    DISCRETE APPLIED MATHEMATICS, 2016, 212 : 104 - 114
  • [4] Random chunks attachment strategy based secure deduplication for cloud data
    Genghao L.
    Ziji Z.
    Xin T.
    Yiteng Z.
    Yuqi Z.
    Tianyang Q.
    Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2023, 50 (05): : 212 - 228
  • [5] Similarity and Locality Based Indexing for High Performance Data Deduplication
    Xia, Wen
    Jiang, Hong
    Feng, Dan
    Hua, Yu
    IEEE TRANSACTIONS ON COMPUTERS, 2015, 64 (04) : 1162 - 1176
  • [6] Secure similarity-based cloud data deduplication in Ubiquitous city
    Liu, Jinfeng
    Wang, Jianfeng
    Tao, Xiaoling
    Shen, Jian
    PERVASIVE AND MOBILE COMPUTING, 2017, 41 : 231 - 242
  • [7] Dynamic determination of variable sizes of chunks in a deduplication system
    Hirsch, Michael
    Klein, Shmuel T.
    Shapira, Dana
    Toaff, Yair
    DISCRETE APPLIED MATHEMATICS, 2020, 274 : 81 - 91
  • [8] On evaluating text similarity measures for customer data deduplication
    Boinski, Pawel
    Sienkiewicz, Mariusz
    Wrembel, Robert
    Bebel, Bartosz
    Andrzejewski, Witold
    38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023, 2023, : 297 - 300
  • [9] SCAIL: Encrypted Deduplication with Segment Chunks and Index Locality
    Ammons, Jaybe
    Fenner, Trevor
    Weston, David
    2022 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE AND STORAGE (NAS), 2022, : 184 - 192
  • [10] GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data
    Cogo, Vinicius
    Paulo, Joao
    Bessani, Alysson
    IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (05) : 669 - 681