A new content-defined chunking algorithm for data deduplication in cloud storage

被引:36
|
作者
Widodo, Ryan N. S. [1 ]
Lim, Hyotaek [2 ]
Atiquzzaman, Mohammed [3 ]
机构
[1] Dongseo Univ, Dept Ubiquitous IT, Busan 617716, South Korea
[2] Dongseo Univ, Div Comp Engn, Busan 617716, South Korea
[3] Univ Oklahoma, Sch Comp Sci, Norman, OK 73019 USA
基金
新加坡国家研究基金会;
关键词
Data deduplication; Cloud storage; Content-defined chunking; Hash-less chunking; Asymmetric window;
D O I
10.1016/j.future.2017.02.013
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance of the system. Content-defined chunking (CDC) is a method to split files into variable length chunks, where the cut points are defined by some internal features of the files. Unlike fixed-length chunks, variable-length chunks are more resistant to byte shifting. Thus, it increases the probability of finding duplicate chunks within a file and between files. However, CDC algorithms require additional computation to find the cut points which might be computationally expensive for some applications. In our previous work (Widodo et al., 2016), the hash-based CDC algorithm used in the system took more process time than other processes in the deduplication system. This paper proposes a high throughput hash-less chunking method called Rapid Asymmetric Maximum (RAM). Instead of using hashes, RAM uses bytes value to declare the cut points. The algorithm utilizes a fix-sized window and a variable-sized window to find a maximum-valued byte which is the cut point. The maximum-valued byte is included in the chunk and located at the boundary of the chunk. This configuration allows RAM to do fewer comparisons while retaining the CDC property. We compared RAM with existing hash-based and hash-less deduplication systems. The experimental results show that our proposed algorithm has higher throughput and bytes saved per second compared to other chunking algorithms. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:145 / 156
页数:12
相关论文
共 50 条
  • [1] A smart hybrid content-defined chunking algorithm for data deduplication in cloud storage
    Ellappan, Manogar
    Murugappan, Abirami
    SOFT COMPUTING, 2023, 28 (15-16) : 9037 - 9052
  • [2] SeqCDC: Hashless Content-Defined Chunking for Data Deduplication
    Udayashankar, Sreeharsha
    Baba, Abdelrahman
    Al-Kiswany, Samer
    PROCEEDINGS OF THE TWENTY-FIFTH ACM INTERNATIONAL MIDDLEWARE CONFERENCE, MIDDLEWARE 2024, 2024, : 292 - 298
  • [3] The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems
    Xia, Wen
    Zou, Xiangyu
    Jiang, Hong
    Zhou, Yukun
    Liu, Chuanyi
    Feng, Dan
    Hua, Yu
    Hu, Yuchong
    Zhang, Yucheng
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (09) : 2017 - 2031
  • [4] Blockchain-based data deduplication using novel content-defined chunking algorithm in cloud environment
    Prakash, J. Jabin
    Ramesh, K.
    Saravanan, K.
    Prabha, G. Lakshmi
    INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT, 2023,
  • [5] Blockchain-based data deduplication using novel content-defined chunking algorithm in cloud environment
    Prakash, Jabin J.
    Ramesh, K.
    Saravanan, K.
    Prabha, Lakshmi G.
    INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT, 2023, 33 (06)
  • [6] Accelerating Content-Defined Chunking for Data Deduplication Based on Speculative Jump
    Jin, Xiaozhong
    Liu, Haikun
    Ye, Chencheng
    Liao, Xiaofei
    Jin, Hai
    Zhang, Yu
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (09) : 2568 - 2579
  • [7] FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication
    Xia, Wen
    Zhou, Yukun
    Jiang, Hong
    Feng, Dan
    Hua, Yu
    Hu, Yuchong
    Zhang, Yucheng
    Liu, Qing
    PROCEEDINGS OF USENIX ATC '16: 2016 USENIX ANNUAL TECHNICAL CONFERENCE, 2016, : 101 - 114
  • [8] UltraCDC:A Fast and Stable Content-Defined Chunking Algorithm for Deduplication-based Backup Storage Systems
    Zhou, Peng
    Wang, Zhenyu
    Xia, Wen
    Zhang, Haotong
    2022 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE, IPCCC, 2022,
  • [9] Implementing Content-Defined Chunking for Deduplication in Host-Managed SSDs
    Chen, Che-Min
    Shih, Yi-Chao
    Liu, Xin
    Shih, Wei-Kuan
    Chen, Tseng-Yi
    2024 IEEE THE 20TH ASIA PACIFIC CONFERENCE ON CIRCUITS AND SYSTEMS, APCCAS 2024, 2024, : 159 - 163
  • [10] Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence
    Saeed, Ahmed Sardar M.
    George, Loay E.
    SYMMETRY-BASEL, 2020, 12 (11): : 1 - 21