A new content-defined chunking algorithm for data deduplication in cloud storage

被引:36
|
作者
Widodo, Ryan N. S. [1 ]
Lim, Hyotaek [2 ]
Atiquzzaman, Mohammed [3 ]
机构
[1] Dongseo Univ, Dept Ubiquitous IT, Busan 617716, South Korea
[2] Dongseo Univ, Div Comp Engn, Busan 617716, South Korea
[3] Univ Oklahoma, Sch Comp Sci, Norman, OK 73019 USA
基金
新加坡国家研究基金会;
关键词
Data deduplication; Cloud storage; Content-defined chunking; Hash-less chunking; Asymmetric window;
D O I
10.1016/j.future.2017.02.013
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance of the system. Content-defined chunking (CDC) is a method to split files into variable length chunks, where the cut points are defined by some internal features of the files. Unlike fixed-length chunks, variable-length chunks are more resistant to byte shifting. Thus, it increases the probability of finding duplicate chunks within a file and between files. However, CDC algorithms require additional computation to find the cut points which might be computationally expensive for some applications. In our previous work (Widodo et al., 2016), the hash-based CDC algorithm used in the system took more process time than other processes in the deduplication system. This paper proposes a high throughput hash-less chunking method called Rapid Asymmetric Maximum (RAM). Instead of using hashes, RAM uses bytes value to declare the cut points. The algorithm utilizes a fix-sized window and a variable-sized window to find a maximum-valued byte which is the cut point. The maximum-valued byte is included in the chunk and located at the boundary of the chunk. This configuration allows RAM to do fewer comparisons while retaining the CDC property. We compared RAM with existing hash-based and hash-less deduplication systems. The experimental results show that our proposed algorithm has higher throughput and bytes saved per second compared to other chunking algorithms. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:145 / 156
页数:12
相关论文
共 50 条
  • [11] Dynamic Prime Chunking Algorithm for Data Deduplication in Cloud Storage
    Ellappan, Manogar
    Abirami, S.
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2021, 15 (04): : 1342 - 1359
  • [12] A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems
    Zhang, Yucheng
    Feng, Dan
    Jiang, Hong
    Xia, Wen
    Fu, Min
    Huang, Fangting
    Zhou, Yukun
    IEEE TRANSACTIONS ON COMPUTERS, 2017, 66 (02) : 199 - 211
  • [13] SuperCDC: A Hybrid Design of High-Performance Content-Defined Chunking for Fast Deduplication
    Wan, Binzhaoshuo
    Pu, Lifeng
    Zou, Xiangyu
    Li, Shiyi
    Wang, Peng
    Xia, Wen
    2022 IEEE 40TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2022), 2022, : 170 - 178
  • [14] Throughput: A key performance measure of Content-Defined Chunking Algorithms
    Chapuis, Bertil
    Garbinato, Benoit
    Andritsos, Periklis
    2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW 2016), 2016, : 7 - 12
  • [15] Influence of expected chunk size on deduplication ratio in content defined chunking algorithm
    School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an
    710049, China
    不详
    Guangdong
    518172, China
    不详
    250101, China
    Hsi An Chiao Tung Ta Hsueh, 1600, 12 (73-78):
  • [16] AE: An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication
    Zhang, Yucheng
    Jiang, Hong
    Feng, Dan
    Xia, Wen
    Fu, Min
    Huang, Fangting
    Zhou, Yukun
    2015 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (INFOCOM), 2015,
  • [17] Accelerating content-defined-chunking based data deduplication by exploiting parallelism
    Xia, Wen
    Feng, Dan
    Jiang, Hong
    Zhang, Yucheng
    Chang, Victor
    Zou, Xiangyu
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 98 : 406 - 418
  • [18] SS-CDC: A Two-stage Parallel Content-Defined Chunking for Deduplicating Backup Storage
    Ni, Fan
    Lin, Xing
    Jiang, Song
    SYSTOR '19: PROCEEDINGS OF THE 12TH ACM INTERNATIONAL SYSTEMS AND STORAGE CONFERENCE, 2019, : 86 - 96
  • [19] New Techniques to Enhance Data Deduplication using Content based-TTTD Chunking Algorithm
    Jasim, Hala AbdulSalam
    Fahad, Assmaa A.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (05) : 116 - 121
  • [20] An Optimal Hierarchical Deduplication Strategy Based on Content Defined Chunking
    Li J.-J.
    Ma Z.-N.
    Zhang K.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2019, 47 (05): : 1094 - 1100