Online Data Deduplication for In-Memory Big-Data Analytic Systems

被引:0
|
作者
Sun, Yushi [1 ]
Zeng, Catherine Y.
Chung, Jaeyoon [2 ]
Huang, Zhe [2 ]
机构
[1] Hong Kong Univ Sci & Technol, Sai Kung, Hong Kong, Peoples R China
[2] Princeton Univ, Princeton, NJ 08544 USA
关键词
CLOUD;
D O I
暂无
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Given a set of files that show a certain degree of similarity, we consider a novel problem of performing data redundancy elimination across a set of distributed worker nodes in a shared-nothing in-memory big data analytic system. The redundancy elimination scheme is designed in a manner that is: (i) space-efficient: the total space needed to store the files is minimized and, (ii) access-isolation: data shuffling among server is also minimized. In this paper, we first show that finding an access-efficient and space optimal solution is an NP-Hard problem. Following this, we present the file partitioning algorithms that locate access-efficient solutions in an incremental manner with minimal algorithm time complexity (polynomial time). Our experimental verification on multiple data sets confirms that the proposed file partitioning solution is able to achieve compression ratio close to the optimal compression performance achieved by a centralized solution.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Quantifying the Performance Impact of Large Pages on In-Memory Big-Data Workloads
    Park, Jinsu
    Han, Myeonggyun
    Baek, Woongki
    [J]. PROCEEDINGS OF THE 2016 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION, 2016, : 209 - 218
  • [2] Libra and the Art of Task Sizing in Big-Data Analytic Systems
    Li, Rui
    Guo, Peizhen
    Hu, Bo
    Hu, Wenjun
    [J]. PROCEEDINGS OF THE 2019 TENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '19), 2019, : 364 - 376
  • [3] In-Memory Performance for Big Data
    Graefe, Goetz
    Volos, Haris
    Kimura, Hideaki
    Kuno, Harumi
    Tucek, Joseph
    Lillibridge, Mark
    Veitch, Alistair
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (01): : 37 - 48
  • [4] BigCache for Big-data Systems
    Roger, Michel Angelo
    Xu, Yiqi
    Zhao, Ming
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2014, : 189 - 194
  • [5] Profiling Memory Vulnerability of Big-data Applications
    Rameshan, N.
    Birke, R.
    Navarro, L.
    Vlassov, V.
    Urgaonkar, B.
    Kesidis, G.
    Schmatz, M.
    Chen, L. Y.
    [J]. 2016 46TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS WORKSHOPS (DSN-W), 2016, : 258 - 261
  • [6] An Out of Memory tSVD for Big-Data Factorization
    Carrillo-Cabada, Hector
    Skau, Erik
    Chennupati, Gopinath
    Alexandrov, Boian
    Djidjev, Hristo
    [J]. IEEE ACCESS, 2020, 8 : 107749 - 107759
  • [7] Enabling Scientific Data Storage and Processing on Big-data Systems
    Biookaghazadeh, Saman
    Xu, Yiqi
    Zhou, Shujia
    Zhao, Ming
    [J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 1978 - 1984
  • [8] In-Memory Big Data Management and Processing: A Survey
    Zhang, Hao
    Chen, Gang
    Ooi, Beng Chin
    Tan, Kian-Lee
    Zhang, Meihui
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (07) : 1920 - 1948
  • [9] Distributed In-Memory Analytics for Big Temporal Data
    Yao, Bin
    Zhang, Wei
    Wang, Zhi-Jie
    Chen, Zhongpu
    Shang, Shuo
    Zheng, Kai
    Guo, Minyi
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2018, PT I, 2018, 10827 : 549 - 565
  • [10] Simba: Spatial In-Memory Big Data Analysis
    Xie, Dong
    Li, Feifei
    Yao, Bin
    Li, Gefei
    Chen, Zhongpu
    Zhou, Liang
    Guo, Minyi
    [J]. 24TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS (ACM SIGSPATIAL GIS 2016), 2016,