Similarity based deduplication with small data chunks

被引:7
|
作者
Aronovich, L. [1 ]
Asher, R. [2 ]
Harnik, D. [2 ]
Hirsch, M. [2 ]
Klein, S. T. [3 ]
Toaff, Y. [2 ]
机构
[1] IBM Corp, Toronto, ON, Canada
[2] IBM Diligent, Tel Aviv, Israel
[3] Bar Ilan Univ, Dept Comp Sci, Ramat Gan, Israel
关键词
Deduplication; Similarity; Small data chunks; Approximate hashing;
D O I
10.1016/j.dam.2015.09.018
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach. This paper deals with an extension of the second approach to systems in which it is still preferred to use small chunks. We describe the design choices made during the development of what we call an approximate hash function, serving as the basic tool of the new suggested deduplication system and report on extensive tests performed on a variety of large input files. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:10 / 22
页数:13
相关论文
共 50 条
  • [41] Secure Textual Data Deduplication Scheme Based on Data Encoding and Compression
    Miri, Ali
    Rashid, Fatema
    2019 IEEE 10TH ANNUAL INFORMATION TECHNOLOGY, ELECTRONICS AND MOBILE COMMUNICATION CONFERENCE (IEMCON), 2019, : 207 - 211
  • [42] Consensus Mechanism of Blockchain Based on PoR with Data Deduplication
    Zhou, Wei
    Wang, Hao
    Mohiuddin, Ghulam
    Chen, Dan
    Ren, Yongjun
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 34 (03): : 1473 - 1488
  • [43] A Client-based Secure Deduplication of Multimedia Data
    Li, Danping
    Yang, Chao
    Li, Chengzhou
    Jiang, Qi
    Chen, Xiaofeng
    Ma, Jianfeng
    Ren, Jian
    2017 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2017,
  • [44] Bucket Based Data Deduplication Technique for Big Data Storage System
    Kumar, Naresh
    Rawat, Rahul
    Jain, S. C.
    2016 5TH INTERNATIONAL CONFERENCE ON RELIABILITY, INFOCOM TECHNOLOGIES AND OPTIMIZATION (TRENDS AND FUTURE DIRECTIONS) (ICRITO), 2016, : 267 - 271
  • [45] Distributed Data Deduplication
    Chu, Xu
    Ilyas, Ihab F.
    Koutris, Paraschos
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2016, 9 (11): : 864 - 875
  • [46] Improving Restore Performance of Packed Datasets in Deduplication Systems via Reducing Persistent Fragmented Chunks
    Zhang, Yucheng
    Fu, Min
    Wu, Xinyun
    Wang, Fang
    Wang, Qiang
    Wang, Chunzhi
    Dong, Xinhua
    Han, Hongmu
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (07) : 1651 - 1664
  • [47] Hybrid Deduplication System-A Block-Level Similarity-Based Approach
    Godavari, Amdewar
    Sudhakar, Chapram
    Ramesh, T.
    IEEE SYSTEMS JOURNAL, 2021, 15 (03): : 3860 - 3870
  • [48] Energy Consumption in Periodical GSM/GPRS Transmissions of Small Data Chunks: An Experimental Study
    Tranca, Dumitru C.
    Markovic, Vera
    2015 12TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS IN MODERN SATELLITE, CABLE AND BROADCASTING SERVICES (TELSIKS), 2015, : 235 - 238
  • [49] Blockchain-Based Shared Data Integrity Auditing and Deduplication
    Miao, Ying
    Gai, Keke
    Zhu, Liehuang
    Choo, Kim-Kwang Raymond
    Vaidya, Jaideep
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2024, 21 (04) : 3688 - 3703
  • [50] A novel data routing strategy based on directories for deduplication clusters
    Wang, Lifang
    Zhang, Zhike
    Jiang, Zejun
    Cai, Xiaobin
    Peng, Chengzhang
    Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University, 2014, 32 (04): : 658 - 663