Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence

被引:6
|
作者
Saeed, Ahmed Sardar M. [1 ]
George, Loay E. [2 ]
机构
[1] Sulaimani Polytech Univ, Tech Coll Informat, Informat Technol, Sulaymanyah 46001, Iraq
[2] Univ Informat Technol & Commun UoITC, Baghdad 10011, Iraq
来源
SYMMETRY-BASEL | 2020年 / 12卷 / 11期
关键词
data deduplication; content-defined chunking; bytes frequency-based chunking; data deduplication gain; hashing; deduplication elimination ratio;
D O I
10.3390/sym12111841
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Every second, millions of data are being generated due to the use of emerging technologies. It is very challenging to store and handle such a large amount of data. Data deduplication is a solution for this problem. It is a new technique that eliminates duplicate data and stores only a single copy of data, reducing storage utilization and the cost of maintaining redundant data. Content-defined chunking (CDC) has been playing an important role in data deduplication systems due to its ability to detect high redundancy. In this paper, we focused on deduplication system optimization by tuning relevant factors in CDC to identify chunk cut-points and introduce an efficient fingerprint using a new hash function. We proposed a novel bytes frequency-based chunking (BFBC) algorithm and a new low-cost hashing function. To evaluate the efficiency of the proposed system, extensive experiments were done using two different datasets. In all experiments, the proposed system persistently outperformed the common CDC algorithms, achieving a better storage gain ratio and enhancing both chunking and hashing throughput. Practically, our experiments show that BFBC is 10 times faster than basic sliding window (BSW) and approximately three times faster than two thresholds two divisors (TTTD). The proposed triple hash function algorithm is five times faster than SHA1 and MD5 and achieves a better deduplication elimination ratio (DER) than other CDC algorithms. The symmetry of our work is based on the balance between the proposed system performance parameters and its reflection on the system efficiency compared to other deduplication systems.
引用
收藏
页码:1 / 21
页数:21
相关论文
共 44 条
  • [31] FORECASTING DEAD FUEL MOISTURE CONTENT AT SPATIAL SCALES USING A PROCESS-BASED MODEL WITH GLOBAL FORECAST SYSTEM DATA
    Fan, Chunquan
    He, Binbin
    Yin, Jianpeng
    Chen, Rui
    Zhang, Hongguo
    Zhang, Yiru
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 3133 - 3136
  • [32] A Method of Content-based Image Analysis Using SVM Classifier in Data Catalogue and Archive System of Remote Sensing Satellite
    Wang XiuLi
    Fan ShiMing
    INTERNATIONAL CONFERENCE ON SPACE INFORMATION TECHNOLOGY 2009, 2010, 7651
  • [33] An All-Digital Wideband OFDM-based Frequency-hopping System using RF Sampling Data Converters
    Bora, Amit Sravan
    Singh, Tourangbam Harishore
    Huang, Po-Tsang
    2021 NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2021, : 1 - 5
  • [34] Ambient-Frequency-Data Based System-Level Inertia Estimation Using Physical Equation and its Practice on Hawaii Islands
    Li, Hongyu
    You, Shutang
    Jiang, Zhihao
    Tan, Jin
    Hoke, Andy
    Liu, Jingzi
    Zhu, Lin
    Rockwell, Brad
    Kruse, Cameron J.
    Liu, Yilu
    IEEE TRANSACTIONS ON POWER SYSTEMS, 2024, 39 (06) : 6948 - 6959
  • [35] Data integrity cyber-attack mitigation using linear quadratic regulator based load frequency control in hybrid power system
    Kapil, Vivek
    Prasad, Sheetla
    INTERNATIONAL JOURNAL OF EMERGING ELECTRIC POWER SYSTEMS, 2024,
  • [36] Digital watermark system based on improved security through pre-processing of watermarked data using information of image discrete frequency
    Sung, Kyung-Sang
    Lee, Seung-Heon
    Wang, Bo-Hyun
    Oh, Hae-Seok
    FOURTH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING RESEARCH, MANAGEMENT AND APPLICATIONS, PROCEEDINGS, 2006, : 276 - +
  • [37] RJR SOFT: A Java-based harmonic content evaluation system using data measurement with Single Tuned Filter and Sine Wave Filter
    Cahyono, Muhammad Ridwan Arif
    Marzuki, Ahmad
    Sari, Riri Fitri
    Proceedings of the 2016 International Conference on Instrumentation, Control, and Automation, ICA 2016, 2017, : 70 - 75
  • [38] Non-contact identification of moisture content of fabric based on analysis of broadband acoustic signals using multiple-frequency air ultrasonic transducer system
    Otsuka, Hideto
    Okubo, Kan
    JAPANESE JOURNAL OF APPLIED PHYSICS, 2022, 61 (SG)
  • [39] Analysis and design of wireless power and data synchronous transfer system based on 2FSK modulation using dual-resonant frequency
    Feng, Jing
    Zhang, Jiantao
    Wei, Guo
    Cui, Jian
    Zhu, Chunbo
    INTERNATIONAL JOURNAL OF CIRCUIT THEORY AND APPLICATIONS, 2022, 50 (11) : 3749 - 3762
  • [40] Complex engineered system health indexes extraction using low frequency raw time-series data based on deep learning methods
    Liu, Cui
    Sun, Jianzhong
    Liu, He
    Lei, Shiying
    Hu, Xinhua
    MEASUREMENT, 2020, 161