A new content-defined chunking algorithm for data deduplication in cloud storage

被引：36

作者：

Widodo, Ryan N. S. ^{[1
]}

Lim, Hyotaek ^{[2
]}

Atiquzzaman, Mohammed ^{[3
]}

机构：

[1] Dongseo Univ, Dept Ubiquitous IT, Busan 617716, South Korea

[2] Dongseo Univ, Div Comp Engn, Busan 617716, South Korea

[3] Univ Oklahoma, Sch Comp Sci, Norman, OK 73019 USA

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2017年 / 71卷

基金：

新加坡国家研究基金会;

关键词：

Data deduplication; Cloud storage; Content-defined chunking; Hash-less chunking; Asymmetric window;

D O I：

10.1016/j.future.2017.02.013

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance of the system. Content-defined chunking (CDC) is a method to split files into variable length chunks, where the cut points are defined by some internal features of the files. Unlike fixed-length chunks, variable-length chunks are more resistant to byte shifting. Thus, it increases the probability of finding duplicate chunks within a file and between files. However, CDC algorithms require additional computation to find the cut points which might be computationally expensive for some applications. In our previous work (Widodo et al., 2016), the hash-based CDC algorithm used in the system took more process time than other processes in the deduplication system. This paper proposes a high throughput hash-less chunking method called Rapid Asymmetric Maximum (RAM). Instead of using hashes, RAM uses bytes value to declare the cut points. The algorithm utilizes a fix-sized window and a variable-sized window to find a maximum-valued byte which is the cut point. The maximum-valued byte is included in the chunk and located at the boundary of the chunk. This configuration allows RAM to do fewer comparisons while retaining the CDC property. We compared RAM with existing hash-based and hash-less deduplication systems. The experimental results show that our proposed algorithm has higher throughput and bytes saved per second compared to other chunking algorithms. (C) 2017 Elsevier B.V. All rights reserved.

引用

页码：145 / 156

页数：12

共 50 条

[11] Dynamic Prime Chunking Algorithm for Data Deduplication in Cloud Storage
Ellappan, Manogar
Abirami, S.
KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2021, 15 (04): : 1342 - 1359
[12] A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems
Zhang, Yucheng
Feng, Dan
Jiang, Hong
Xia, Wen
Fu, Min
Huang, Fangting
Zhou, Yukun
IEEE TRANSACTIONS ON COMPUTERS, 2017, 66 (02) : 199 - 211
[13] SuperCDC: A Hybrid Design of High-Performance Content-Defined Chunking for Fast Deduplication
Wan, Binzhaoshuo
Pu, Lifeng
Zou, Xiangyu
Li, Shiyi
Wang, Peng
Xia, Wen
2022 IEEE 40TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2022), 2022, : 170 - 178
[14] Throughput: A key performance measure of Content-Defined Chunking Algorithms
Chapuis, Bertil
Garbinato, Benoit
Andritsos, Periklis
2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW 2016), 2016, : 7 - 12
[15] Influence of expected chunk size on deduplication ratio in content defined chunking algorithm
School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an
710049, China
不详
Guangdong
518172, China
不详
250101, China
Hsi An Chiao Tung Ta Hsueh, 1600, 12 (73-78):
[16] AE: An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication
Zhang, Yucheng
Jiang, Hong
Feng, Dan
Xia, Wen
Fu, Min
Huang, Fangting
Zhou, Yukun
2015 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (INFOCOM), 2015,
[17] Accelerating content-defined-chunking based data deduplication by exploiting parallelism
Xia, Wen
Feng, Dan
Jiang, Hong
Zhang, Yucheng
Chang, Victor
Zou, Xiangyu
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 98 : 406 - 418
[18] SS-CDC: A Two-stage Parallel Content-Defined Chunking for Deduplicating Backup Storage
Ni, Fan
Lin, Xing
Jiang, Song
SYSTOR '19: PROCEEDINGS OF THE 12TH ACM INTERNATIONAL SYSTEMS AND STORAGE CONFERENCE, 2019, : 86 - 96
[19] New Techniques to Enhance Data Deduplication using Content based-TTTD Chunking Algorithm
Jasim, Hala AbdulSalam
Fahad, Assmaa A.
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (05) : 116 - 121
[20] An Optimal Hierarchical Deduplication Strategy Based on Content Defined Chunking
Li J.-J.
Ma Z.-N.
Zhang K.
Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2019, 47 (05): : 1094 - 1100

← 1 2 3 4 5 →