Similarity based deduplication with small data chunks

被引:7
|
作者
Aronovich, L. [1 ]
Asher, R. [2 ]
Harnik, D. [2 ]
Hirsch, M. [2 ]
Klein, S. T. [3 ]
Toaff, Y. [2 ]
机构
[1] IBM Corp, Toronto, ON, Canada
[2] IBM Diligent, Tel Aviv, Israel
[3] Bar Ilan Univ, Dept Comp Sci, Ramat Gan, Israel
关键词
Deduplication; Similarity; Small data chunks; Approximate hashing;
D O I
10.1016/j.dam.2015.09.018
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach. This paper deals with an extension of the second approach to systems in which it is still preferred to use small chunks. We describe the design choices made during the development of what we call an approximate hash function, serving as the basic tool of the new suggested deduplication system and report on extensive tests performed on a variety of large input files. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:10 / 22
页数:13
相关论文
共 50 条
  • [11] Leverage Similarity and Locality to Enhance Fingerprint Prefetching of Data Deduplication
    Zhou, Yongtao
    Deng, Yuhui
    Xie, Junjie
    2014 20TH IEEE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2014, : 142 - 149
  • [12] Vertex Deduplication Based on String Similarity and Community Membership
    McConville, Ryan
    Liu, Weiru
    Hong, Jun
    COMPLEX NETWORKS & THEIR APPLICATIONS VI, 2018, 689 : 178 - 189
  • [13] A strategy of de-duplication based on the similarity of adjacent chunks
    Zhou B.
    Tan J.-H.
    2017, Taru Publications (20) : 1577 - 1580
  • [14] Data Deduplication based on Hadoop
    Zhang, Dongzhan
    Liao, Chengfa
    Yan, Wenjing
    Tao, Ran
    Zheng, Wei
    2017 FIFTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2017, : 147 - 152
  • [15] DMS: a Dynamic Multi-tiered Storage with Deduplication Based on Variable-Sized Chunks
    Liu, Xiao
    Zhou, Bin
    PROCEEDINGS OF 2017 6TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2017), 2017, : 127 - 131
  • [16] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
    Rozinek, Ondrej
    Borkovcova, Monika
    Mares, Jan
    Lecture Notes in Networks and Systems, 2024, 990 LNNS : 181 - 191
  • [17] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
    Rozinek, Ondrej
    Borkovcova, Monika
    Mares, Jan
    GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 6, WORLDCIST 2024, 2024, 990 : 181 - 191
  • [18] ProSy: A Similarity Based Inline Deduplication System For Primary Storage
    Du, Xin
    Hu, Weizheng
    Wang, Qiang
    Wang, Fang
    PROCEEDINGS OF THE 2015 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE AND STORAGE (NAS), 2015, : 195 - 204
  • [19] Secure Encrypted Data Deduplication Based on Data Popularity
    Yunlong He
    Hequn Xian
    Liming Wang
    Shuguang Zhang
    Mobile Networks and Applications, 2021, 26 : 1686 - 1695
  • [20] Secure Encrypted Data Deduplication Based on Data Popularity
    He, Yunlong
    Xian, Hequn
    Wang, Liming
    Zhang, Shuguang
    MOBILE NETWORKS & APPLICATIONS, 2021, 26 (04): : 1686 - 1695