Similarity based deduplication with small data chunks

被引:7
|
作者
Aronovich, L. [1 ]
Asher, R. [2 ]
Harnik, D. [2 ]
Hirsch, M. [2 ]
Klein, S. T. [3 ]
Toaff, Y. [2 ]
机构
[1] IBM Corp, Toronto, ON, Canada
[2] IBM Diligent, Tel Aviv, Israel
[3] Bar Ilan Univ, Dept Comp Sci, Ramat Gan, Israel
关键词
Deduplication; Similarity; Small data chunks; Approximate hashing;
D O I
10.1016/j.dam.2015.09.018
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach. This paper deals with an extension of the second approach to systems in which it is still preferred to use small chunks. We describe the design choices made during the development of what we call an approximate hash function, serving as the basic tool of the new suggested deduplication system and report on extensive tests performed on a variety of large input files. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:10 / 22
页数:13
相关论文
共 50 条
  • [21] From chunks to function-argument structure:: A similarity-based approach
    Kübler, S
    Hinrichs, EW
    39TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2001, : 338 - 345
  • [22] Request Merging Based Cross-User Deduplication for Cloud Storage with Resistance Against Appending Chunks Attack
    TANG Xin
    ZHANG Yi
    ZHOU Linna
    LIU Dan
    HU Bingwei
    ChineseJournalofElectronics, 2021, 30 (02) : 199 - 209
  • [23] Similarity-Based Secure Deduplication for IIoT Cloud Management System
    Gao, Yuan
    Chen, Liquan
    Han, Jinguang
    Yu, Shui
    Fang, Huiyu
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2024, 21 (04) : 2242 - 2256
  • [24] Similarity-based deduplication and secure auditing in IoT decentralized storage
    Gao, Yuan
    Chen, Liquan
    Han, Jinguang
    Wu, Ge
    Liu, Suhui
    JOURNAL OF SYSTEMS ARCHITECTURE, 2023, 142
  • [25] A similarity clustering-based deduplication strategy in cloud storage systems
    Long, Saiqin
    Li, Zhetao
    Liu, Zihao
    Deng, Qingyong
    Oh, Sangyoon
    Komuro, Nobuyoshi
    2020 IEEE 26TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2020, : 35 - 43
  • [26] Request Merging Based Cross-User Deduplication for Cloud Storage with Resistance Against Appending Chunks Attack
    Xin, Tang
    Yi, Zhang
    Linna, Zhou
    Dan, Liu
    Bingwei, Hu
    CHINESE JOURNAL OF ELECTRONICS, 2021, 30 (02) : 199 - 209
  • [27] Cloud Based Deduplication and Self Data Destruction
    Deshmukh, Ankush R.
    Mante, R. V.
    Chatur, P. N.
    2017 INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRICAL, ELECTRONICS AND COMPUTING TECHNOLOGIES (ICRTEECT), 2017, : 155 - 158
  • [28] RESEARCH OF NETWORK STORAGE BASED ON DATA DEDUPLICATION
    Zhang, Wei
    Wang, Huajun
    Lu, Hanyu
    Huang, Wei
    2011 3RD INTERNATIONAL CONFERENCE ON COMPUTER TECHNOLOGY AND DEVELOPMENT (ICCTD 2011), VOL 3, 2012, : 555 - 559
  • [29] Lightweight Secure Deduplication Based on Data Popularity
    Wang, Zhiqiang
    Gao, Wenjing
    Yu, Jia
    Shen, Wengting
    Hao, Rong
    IEEE SYSTEMS JOURNAL, 2023, 17 (04): : 5531 - 5542
  • [30] Learning-based Fusion for Data Deduplication
    Dinerstein, Jared
    Dinerstein, Sabra
    Egbert, Parris K.
    Clyde, Stephen W.
    SEVENTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2008, : 66 - +