Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence

被引:6
|
作者
Saeed, Ahmed Sardar M. [1 ]
George, Loay E. [2 ]
机构
[1] Sulaimani Polytech Univ, Tech Coll Informat, Informat Technol, Sulaymanyah 46001, Iraq
[2] Univ Informat Technol & Commun UoITC, Baghdad 10011, Iraq
来源
SYMMETRY-BASEL | 2020年 / 12卷 / 11期
关键词
data deduplication; content-defined chunking; bytes frequency-based chunking; data deduplication gain; hashing; deduplication elimination ratio;
D O I
10.3390/sym12111841
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Every second, millions of data are being generated due to the use of emerging technologies. It is very challenging to store and handle such a large amount of data. Data deduplication is a solution for this problem. It is a new technique that eliminates duplicate data and stores only a single copy of data, reducing storage utilization and the cost of maintaining redundant data. Content-defined chunking (CDC) has been playing an important role in data deduplication systems due to its ability to detect high redundancy. In this paper, we focused on deduplication system optimization by tuning relevant factors in CDC to identify chunk cut-points and introduce an efficient fingerprint using a new hash function. We proposed a novel bytes frequency-based chunking (BFBC) algorithm and a new low-cost hashing function. To evaluate the efficiency of the proposed system, extensive experiments were done using two different datasets. In all experiments, the proposed system persistently outperformed the common CDC algorithms, achieving a better storage gain ratio and enhancing both chunking and hashing throughput. Practically, our experiments show that BFBC is 10 times faster than basic sliding window (BSW) and approximately three times faster than two thresholds two divisors (TTTD). The proposed triple hash function algorithm is five times faster than SHA1 and MD5 and achieves a better deduplication elimination ratio (DER) than other CDC algorithms. The symmetry of our work is based on the balance between the proposed system performance parameters and its reflection on the system efficiency compared to other deduplication systems.
引用
收藏
页码:1 / 21
页数:21
相关论文
共 44 条
  • [1] SeqCDC: Hashless Content-Defined Chunking for Data Deduplication
    Udayashankar, Sreeharsha
    Baba, Abdelrahman
    Al-Kiswany, Samer
    PROCEEDINGS OF THE TWENTY-FIFTH ACM INTERNATIONAL MIDDLEWARE CONFERENCE, MIDDLEWARE 2024, 2024, : 292 - 298
  • [2] Accelerating Content-Defined Chunking for Data Deduplication Based on Speculative Jump
    Jin, Xiaozhong
    Liu, Haikun
    Ye, Chencheng
    Liao, Xiaofei
    Jin, Hai
    Zhang, Yu
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (09) : 2568 - 2579
  • [3] The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems
    Xia, Wen
    Zou, Xiangyu
    Jiang, Hong
    Zhou, Yukun
    Liu, Chuanyi
    Feng, Dan
    Hua, Yu
    Hu, Yuchong
    Zhang, Yucheng
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (09) : 2017 - 2031
  • [4] A new content-defined chunking algorithm for data deduplication in cloud storage
    Widodo, Ryan N. S.
    Lim, Hyotaek
    Atiquzzaman, Mohammed
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 71 : 145 - 156
  • [5] FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication
    Xia, Wen
    Zhou, Yukun
    Jiang, Hong
    Feng, Dan
    Hua, Yu
    Hu, Yuchong
    Zhang, Yucheng
    Liu, Qing
    PROCEEDINGS OF USENIX ATC '16: 2016 USENIX ANNUAL TECHNICAL CONFERENCE, 2016, : 101 - 114
  • [6] Blockchain-based data deduplication using novel content-defined chunking algorithm in cloud environment
    Prakash, J. Jabin
    Ramesh, K.
    Saravanan, K.
    Prabha, G. Lakshmi
    INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT, 2023,
  • [7] Blockchain-based data deduplication using novel content-defined chunking algorithm in cloud environment
    Prakash, Jabin J.
    Ramesh, K.
    Saravanan, K.
    Prabha, Lakshmi G.
    INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT, 2023, 33 (06)
  • [8] A smart hybrid content-defined chunking algorithm for data deduplication in cloud storage
    Ellappan, Manogar
    Murugappan, Abirami
    SOFT COMPUTING, 2023, 28 (15-16) : 9037 - 9052
  • [9] Implementing Content-Defined Chunking for Deduplication in Host-Managed SSDs
    Chen, Che-Min
    Shih, Yi-Chao
    Liu, Xin
    Shih, Wei-Kuan
    Chen, Tseng-Yi
    2024 IEEE THE 20TH ASIA PACIFIC CONFERENCE ON CIRCUITS AND SYSTEMS, APCCAS 2024, 2024, : 159 - 163
  • [10] SuperCDC: A Hybrid Design of High-Performance Content-Defined Chunking for Fast Deduplication
    Wan, Binzhaoshuo
    Pu, Lifeng
    Zou, Xiangyu
    Li, Shiyi
    Wang, Peng
    Xia, Wen
    2022 IEEE 40TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2022), 2022, : 170 - 178