Hadoop Based Scalable Cluster Deduplication for Big Data

被引:4
|
作者
Liu, Qing [1 ]
Fu, Yinjin [1 ]
Ni, Guiqiang [1 ]
Hou, Rui [2 ]
机构
[1] PLA Univ Sci & Technol, Coll Command Informat Syst, Nanjing, Jiangsu, Peoples R China
[2] Inst Elect Syst Engn, Beijing, Peoples R China
关键词
data deduplication; big data; Hadoop; HBase; index management;
D O I
10.1109/ICDCSW.2016.17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The exponential growth of data has brought a tremendous challenge on the storage system in data center. Data deduplication technology which detects and eliminates redundant data in the dataset can greatly reduce the quantity of data and optimize the utilization of storage space. This paper presented a scalable and reliable cluster deduplication system Halodedu over the Hadoop-based cloud computing platform. Halodedu used MapReduce and HDFS to realize parallel deduplication processing and manage data storage, respectively. Intra-node local database was used to build up a fast and distributed chunk fingerprint index management. In order to maintain the availability and reliability of metadata, HBase was utilized to store the metadata of backup files. We further used virtual machine images as input dataset to evaluate Halodedu. The comparative experiments demonstrated that Halodedu has improvements on deduplication speed and system scalability.
引用
收藏
页码:98 / 105
页数:8
相关论文
共 50 条
  • [21] Deduplication on Encrypted Big Data in Cloud
    Yan, Zheng
    Ding, Wenxiu
    Yu, Xixun
    Zhu, Haiqi
    Deng, Robert H.
    [J]. IEEE Transactions on Big Data, 2016, 2 (02): : 138 - 150
  • [22] The Research on Big Data Security Architecture Based on Hadoop
    Zhuang, Miao
    [J]. PROCEEDINGS OF THE 2015 4TH NATIONAL CONFERENCE ON ELECTRICAL, ELECTRONICS AND COMPUTER ENGINEERING ( NCEECE 2015), 2016, 47 : 241 - 244
  • [23] Power Big Data platform Based on Hadoop Technology
    Chen, Jilin
    Liu, Nana
    Chen, Yong
    Qiu, Weijiang
    [J]. PROCEEDINGS OF THE 2016 6TH INTERNATIONAL CONFERENCE ON MACHINERY, MATERIALS, ENVIRONMENT, BIOTECHNOLOGY AND COMPUTER (MMEBC), 2016, 88 : 571 - 576
  • [24] Hadoop based Demography Big Data Management System
    Bukhari, Syeda Sana
    Park, JinHyuck
    Shin, Dong Ryeol
    [J]. 2018 19TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2018, : 93 - 98
  • [25] Performance Evaluation Of Association Mining In Hadoop Single Node Cluster With Big Data
    Asbern, A.
    Asha, P.
    [J]. 2015 INTERNATIONAL CONFERENCED ON CIRCUITS, POWER AND COMPUTING TECHNOLOGIES (ICCPCT-2015), 2015,
  • [26] Elastic Data Routing in Cluster-based Deduplication Systems
    Wang, Yufeng
    Tang, Shaojie
    Tan, Chiu C.
    [J]. 2014 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (INFOCOM WKSHPS), 2014, : 117 - 118
  • [27] Differential Evolution based bucket indexed data deduplication for big data storage
    Kumar, Naresh
    Antwal, Shobha
    Jain, S. C.
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2018, 34 (01) : 491 - 505
  • [28] SDVC: A Scalable Deduplication Cluster for Virtual Machine Images in Cloud
    Lin, Chuan
    Cao, Qiang
    Zhang, Hongliang
    Huang, Guoqiang
    Xie, Changsheng
    [J]. 2014 9TH IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE, AND STORAGE (NAS), 2014, : 88 - 92
  • [29] Hadoop Based Parallel Deduplication Method for Web Documents
    Song, Junjie
    Liu, Jin
    Zheng, Yuhui
    [J]. ADVANCES IN COMPUTER SCIENCE AND UBIQUITOUS COMPUTING, 2018, 474 : 499 - 504
  • [30] Design of an Exact Data Deduplication Cluster
    Kaiser, Juergen
    Meister, Dirk
    Brinkmann, Andre
    Effert, Sascha
    [J]. 2012 IEEE 28TH SYMPOSIUM ON MASS STORAGE SYSTEMS AND TECHNOLOGIES (MSST), 2012,