Hadoop Based Scalable Cluster Deduplication for Big Data

被引：4

作者：

Liu, Qing ^{[1
]}

Fu, Yinjin ^{[1
]}

Ni, Guiqiang ^{[1
]}

Hou, Rui ^{[2
]}

机构：

[1] PLA Univ Sci & Technol, Coll Command Informat Syst, Nanjing, Jiangsu, Peoples R China

[2] Inst Elect Syst Engn, Beijing, Peoples R China

来源：

2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW 2016) | 2016年

关键词：

data deduplication; big data; Hadoop; HBase; index management;

D O I：

10.1109/ICDCSW.2016.17

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The exponential growth of data has brought a tremendous challenge on the storage system in data center. Data deduplication technology which detects and eliminates redundant data in the dataset can greatly reduce the quantity of data and optimize the utilization of storage space. This paper presented a scalable and reliable cluster deduplication system Halodedu over the Hadoop-based cloud computing platform. Halodedu used MapReduce and HDFS to realize parallel deduplication processing and manage data storage, respectively. Intra-node local database was used to build up a fast and distributed chunk fingerprint index management. In order to maintain the availability and reliability of metadata, HBase was utilized to store the metadata of backup files. We further used virtual machine images as input dataset to evaluate Halodedu. The comparative experiments demonstrated that Halodedu has improvements on deduplication speed and system scalability.

引用

页码：98 / 105

页数：8

共 50 条

[1] Data Deduplication based on Hadoop
Zhang, Dongzhan
Liao, Chengfa
Yan, Wenjing
Tao, Ran
Zheng, Wei
[J]. 2017 FIFTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2017, : 147 - 152
[2] Big Data Analysis Using Hadoop Cluster
Saldhi, Ankita
Goel, Abhinav
Yadav, Dipesh
Saldhi, Ankur
Saksena, Dhruv
Indu, S.
[J]. 2014 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (IEEE ICCIC), 2014, : 572 - 575
[3] SecDedoop: Secure Deduplication with Access Control of Big Data in the HDFS/Hadoop Environment
Ramya, P.
Sundar, C.
[J]. BIG DATA, 2020, 8 (02) : 147 - 163
[4] BIG-BIO: - Big Data Hadoop-based Analytic Cluster Framework for Bioinformatics
Abul Seoud, Rania Ahmed Abdel Azeem
Mahmoud, Mahmoud Ahmed
Ramadan, Amr Essam Eldin
[J]. 2017 INTERNATIONAL CONFERENCE ON INFORMATICS, HEALTH & TECHNOLOGY (ICIHT), 2017,
[5] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
Rozinek, Ondrej
Borkovcova, Monika
Mares, Jan
[J]. GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 6, WORLDCIST 2024, 2024, 990 : 181 - 191
[6] A cluster-based data deduplication technology
Tseng, Chuan-Mu
Ciou, Jheng-Rong
Liu, Tzong-Jye
[J]. 2014 SECOND INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2014, : 226 - 230
[7] Mining the Associated Patterns in Big Data Using Hadoop Cluster
Asha, P.
Jacob, T. Prem
Pravin, A.
Asbern, A.
[J]. INTERNATIONAL CONFERENCE ON INTELLIGENT DATA COMMUNICATION TECHNOLOGIES AND INTERNET OF THINGS, ICICI 2018, 2019, 26 : 1255 - 1263
[8] A Bloom Filter-Based Data Deduplication for Big Data
Podder, Shrayasi
Mukherjee, S.
[J]. ADVANCES IN DATA AND INFORMATION SCIENCES, VOL 1, 2018, 38 : 161 - 168
[9] Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web
Mattmann, Chris A.
Sharan, Madhav
[J]. PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 117 - 120
[10] Performance Modeling and Analysis of a Hadoop Cluster for Efficient Big Data Processing
Lim, JongBeom
Ahnh, Jong-Suk
Lee, Kang-Woo
[J]. ADVANCED SCIENCE LETTERS, 2016, 22 (09) : 2314 - 2319

← 1 2 3 4 5 →