Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence

被引:6
|
作者
Saeed, Ahmed Sardar M. [1 ]
George, Loay E. [2 ]
机构
[1] Sulaimani Polytech Univ, Tech Coll Informat, Informat Technol, Sulaymanyah 46001, Iraq
[2] Univ Informat Technol & Commun UoITC, Baghdad 10011, Iraq
来源
SYMMETRY-BASEL | 2020年 / 12卷 / 11期
关键词
data deduplication; content-defined chunking; bytes frequency-based chunking; data deduplication gain; hashing; deduplication elimination ratio;
D O I
10.3390/sym12111841
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Every second, millions of data are being generated due to the use of emerging technologies. It is very challenging to store and handle such a large amount of data. Data deduplication is a solution for this problem. It is a new technique that eliminates duplicate data and stores only a single copy of data, reducing storage utilization and the cost of maintaining redundant data. Content-defined chunking (CDC) has been playing an important role in data deduplication systems due to its ability to detect high redundancy. In this paper, we focused on deduplication system optimization by tuning relevant factors in CDC to identify chunk cut-points and introduce an efficient fingerprint using a new hash function. We proposed a novel bytes frequency-based chunking (BFBC) algorithm and a new low-cost hashing function. To evaluate the efficiency of the proposed system, extensive experiments were done using two different datasets. In all experiments, the proposed system persistently outperformed the common CDC algorithms, achieving a better storage gain ratio and enhancing both chunking and hashing throughput. Practically, our experiments show that BFBC is 10 times faster than basic sliding window (BSW) and approximately three times faster than two thresholds two divisors (TTTD). The proposed triple hash function algorithm is five times faster than SHA1 and MD5 and achieves a better deduplication elimination ratio (DER) than other CDC algorithms. The symmetry of our work is based on the balance between the proposed system performance parameters and its reflection on the system efficiency compared to other deduplication systems.
引用
收藏
页码:1 / 21
页数:21
相关论文
共 44 条
  • [21] Content-Based Deduplication of Data Using Erasure Technique for RTO Cloud
    Pal, Shweta
    More, Kiran
    Pise, Priya
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMMUNICATION AND COMPUTING TECHNOLOGY (ICACCT), 2018, : 109 - 113
  • [22] A secure framework for managing data in cloud storage using rapid asymmetric maximum based dynamic size chunking and fuzzy logic for deduplication
    Rajkumar, K.
    Hariharan, U.
    Dhanakoti, V.
    Muthukumaran, N.
    WIRELESS NETWORKS, 2024, 30 (01) : 321 - 334
  • [23] A secure framework for managing data in cloud storage using rapid asymmetric maximum based dynamic size chunking and fuzzy logic for deduplication
    K. Rajkumar
    U. Hariharan
    V. Dhanakoti
    N. Muthukumaran
    Wireless Networks, 2024, 30 : 321 - 334
  • [24] Speech noise reduction system based on frequency domain ALE using windowed modified DFT pair
    Nakanishi, I
    Nagata, Y
    Asakura, T
    Itoh, Y
    Fukui, Y
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2006, E89A (04) : 950 - 959
  • [25] Simulation based Coffee Beans Moisture Content Meter with Data Storage using High Frequency Based Measuring Sensor
    Tan, Gerhard P.
    Co, Sam Yverson S.
    Golingay, Laurene Almira T.
    Yanga, Euge F.
    Galvez, Hilary Nica S.
    Solis, Kim Patrick T.
    Zapata, Ronaleen M.
    2021 IEEE REGION 10 SYMPOSIUM (TENSYMP), 2021,
  • [26] Implementation of Link-16 based Tactical Data Link System Using Software-Defined Radio
    Suryana, Joko
    Candra, Deni
    PROCEEDING OF 2019 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICEEI), 2019, : 555 - 559
  • [27] Studying the Parameters of Frequency Dispersion for Radio Links of Different Length Using Software-Defined Radio Based Sounding System
    Ivanov, V. A.
    Ivanov, D. V.
    Ryabova, N. V.
    Ryabova, M. I.
    Chernov, A. A.
    Ovchinnikov, V. V.
    RADIO SCIENCE, 2019, 54 (01) : 34 - 43
  • [28] An Alternative Voltage and Frequency Monitoring Scheme for SCADA based Communication in Power System using Data Compression
    Sarkar, Subhra J.
    Das, Barsha
    Dutta, Trishayan
    Dey, Panchalika
    Mukherjee, Aindrila
    2015 INTERNATIONAL CONFERENCE AND WORKSHOP ON COMPUTING AND COMMUNICATION (IEMCON), 2015,
  • [29] Content-based trademark retrieval system using a new region based shape description method: The distance-angle pair-wise histogram
    Lin, SFD
    Hsu, BY
    Yang, XL
    22ND INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOP, PROCEEDINGS, 2002, : 191 - 195
  • [30] M-CBIR: A medical content-based image retrieval system using metric data-structures
    Chuctaya, Herbert
    Portugal, Christian
    Beltran, Cesar
    Gutierrez, Juan
    Lopez, Cristian
    Tupac, Yvan
    2011 30TH INTERNATIONAL CONFERENCE OF THE CHILEAN COMPUTER SCIENCE SOCIETY (SCCC), 2012, : 135 - 141