Similarity based deduplication with small data chunks

被引:7
|
作者
Aronovich, L. [1 ]
Asher, R. [2 ]
Harnik, D. [2 ]
Hirsch, M. [2 ]
Klein, S. T. [3 ]
Toaff, Y. [2 ]
机构
[1] IBM Corp, Toronto, ON, Canada
[2] IBM Diligent, Tel Aviv, Israel
[3] Bar Ilan Univ, Dept Comp Sci, Ramat Gan, Israel
关键词
Deduplication; Similarity; Small data chunks; Approximate hashing;
D O I
10.1016/j.dam.2015.09.018
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach. This paper deals with an extension of the second approach to systems in which it is still preferred to use small chunks. We describe the design choices made during the development of what we call an approximate hash function, serving as the basic tool of the new suggested deduplication system and report on extensive tests performed on a variety of large input files. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:10 / 22
页数:13
相关论文
共 50 条
  • [31] A Novel Frequency Based Chunking for Data Deduplication
    Zhang, Yunhe
    Wang, Weiling
    Yin, Ting
    Yuan, Jiang
    ADVANCES IN MECHATRONICS AND CONTROL ENGINEERING, PTS 1-3, 2013, 278-280 : 2048 - 2053
  • [32] A cluster-based data deduplication technology
    Tseng, Chuan-Mu
    Ciou, Jheng-Rong
    Liu, Tzong-Jye
    2014 SECOND INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2014, : 226 - 230
  • [33] A Cloud Based Model for Deduplication of Large Data
    Kirubakaran, R.
    Prathibhan, Mano C.
    Karthika, C.
    2015 IEEE INTERNATIONAL CONFERENCE ON ENGINEERING AND TECHNOLOGY (ICETECH), 2015, : 145 - 148
  • [34] A Bloom Filter-Based Data Deduplication for Big Data
    Podder, Shrayasi
    Mukherjee, S.
    ADVANCES IN DATA AND INFORMATION SCIENCES, VOL 1, 2018, 38 : 161 - 168
  • [35] Random Chunks Generation Attack Resistant Cross-User Deduplication for Cloud Storage
    Tang, Xin
    Zhou, Yiteng
    Zhu, Yudan
    Fu, Mingjun
    Jin, Luchao
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 1022 - 1030
  • [36] ESDedup: An efficient and secure deduplication scheme based on data similarity and blockchain for cloud-assisted medical storage systems
    Ling Xiao
    Beiji Zou
    Chengzhang Zhu
    Fanbo Nie
    The Journal of Supercomputing, 2023, 79 : 2932 - 2960
  • [37] ESDedup: An efficient and secure deduplication scheme based on data similarity and blockchain for cloud-assisted medical storage systems
    Xiao, Ling
    Zou, Beiji
    Zhu, Chengzhang
    Nie, Fanbo
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (03): : 2932 - 2960
  • [38] A Privacy Protection Mechanism for NoSql Database Based on Data Chunks
    Sun, Shibin
    Shi, Yuliang
    Zhang, Shidong
    Cui, Lizhen
    2016 IEEE TRUSTCOM/BIGDATASE/ISPA, 2016, : 829 - 836
  • [39] Hadoop Based Scalable Cluster Deduplication for Big Data
    Liu, Qing
    Fu, Yinjin
    Ni, Guiqiang
    Hou, Rui
    2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW 2016), 2016, : 98 - 105
  • [40] A secure data deduplication scheme based on differential privacy
    Ren, Jun
    Yao, Zhiqiang
    Xiong, Jinbo
    Zhang, Yuanyuan
    Ye, Ayong
    2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 1241 - 1246