Hadoop Based Scalable Cluster Deduplication for Big Data

被引:4
|
作者
Liu, Qing [1 ]
Fu, Yinjin [1 ]
Ni, Guiqiang [1 ]
Hou, Rui [2 ]
机构
[1] PLA Univ Sci & Technol, Coll Command Informat Syst, Nanjing, Jiangsu, Peoples R China
[2] Inst Elect Syst Engn, Beijing, Peoples R China
关键词
data deduplication; big data; Hadoop; HBase; index management;
D O I
10.1109/ICDCSW.2016.17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The exponential growth of data has brought a tremendous challenge on the storage system in data center. Data deduplication technology which detects and eliminates redundant data in the dataset can greatly reduce the quantity of data and optimize the utilization of storage space. This paper presented a scalable and reliable cluster deduplication system Halodedu over the Hadoop-based cloud computing platform. Halodedu used MapReduce and HDFS to realize parallel deduplication processing and manage data storage, respectively. Intra-node local database was used to build up a fast and distributed chunk fingerprint index management. In order to maintain the availability and reliability of metadata, HBase was utilized to store the metadata of backup files. We further used virtual machine images as input dataset to evaluate Halodedu. The comparative experiments demonstrated that Halodedu has improvements on deduplication speed and system scalability.
引用
收藏
页码:98 / 105
页数:8
相关论文
共 50 条
  • [1] Data Deduplication based on Hadoop
    Zhang, Dongzhan
    Liao, Chengfa
    Yan, Wenjing
    Tao, Ran
    Zheng, Wei
    [J]. 2017 FIFTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2017, : 147 - 152
  • [2] Big Data Analysis Using Hadoop Cluster
    Saldhi, Ankita
    Goel, Abhinav
    Yadav, Dipesh
    Saldhi, Ankur
    Saksena, Dhruv
    Indu, S.
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (IEEE ICCIC), 2014, : 572 - 575
  • [3] SecDedoop: Secure Deduplication with Access Control of Big Data in the HDFS/Hadoop Environment
    Ramya, P.
    Sundar, C.
    [J]. BIG DATA, 2020, 8 (02) : 147 - 163
  • [4] BIG-BIO: - Big Data Hadoop-based Analytic Cluster Framework for Bioinformatics
    Abul Seoud, Rania Ahmed Abdel Azeem
    Mahmoud, Mahmoud Ahmed
    Ramadan, Amr Essam Eldin
    [J]. 2017 INTERNATIONAL CONFERENCE ON INFORMATICS, HEALTH & TECHNOLOGY (ICIHT), 2017,
  • [5] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
    Rozinek, Ondrej
    Borkovcova, Monika
    Mares, Jan
    [J]. GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 6, WORLDCIST 2024, 2024, 990 : 181 - 191
  • [6] A cluster-based data deduplication technology
    Tseng, Chuan-Mu
    Ciou, Jheng-Rong
    Liu, Tzong-Jye
    [J]. 2014 SECOND INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2014, : 226 - 230
  • [7] Mining the Associated Patterns in Big Data Using Hadoop Cluster
    Asha, P.
    Jacob, T. Prem
    Pravin, A.
    Asbern, A.
    [J]. INTERNATIONAL CONFERENCE ON INTELLIGENT DATA COMMUNICATION TECHNOLOGIES AND INTERNET OF THINGS, ICICI 2018, 2019, 26 : 1255 - 1263
  • [8] A Bloom Filter-Based Data Deduplication for Big Data
    Podder, Shrayasi
    Mukherjee, S.
    [J]. ADVANCES IN DATA AND INFORMATION SCIENCES, VOL 1, 2018, 38 : 161 - 168
  • [9] Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web
    Mattmann, Chris A.
    Sharan, Madhav
    [J]. PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 117 - 120
  • [10] Performance Modeling and Analysis of a Hadoop Cluster for Efficient Big Data Processing
    Lim, JongBeom
    Ahnh, Jong-Suk
    Lee, Kang-Woo
    [J]. ADVANCED SCIENCE LETTERS, 2016, 22 (09) : 2314 - 2319