Online Data Deduplication for In-Memory Big-Data Analytic Systems

被引:0
|
作者
Sun, Yushi [1 ]
Zeng, Catherine Y.
Chung, Jaeyoon [2 ]
Huang, Zhe [2 ]
机构
[1] Hong Kong Univ Sci & Technol, Sai Kung, Hong Kong, Peoples R China
[2] Princeton Univ, Princeton, NJ 08544 USA
关键词
CLOUD;
D O I
暂无
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Given a set of files that show a certain degree of similarity, we consider a novel problem of performing data redundancy elimination across a set of distributed worker nodes in a shared-nothing in-memory big data analytic system. The redundancy elimination scheme is designed in a manner that is: (i) space-efficient: the total space needed to store the files is minimized and, (ii) access-isolation: data shuffling among server is also minimized. In this paper, we first show that finding an access-efficient and space optimal solution is an NP-Hard problem. Following this, we present the file partitioning algorithms that locate access-efficient solutions in an incremental manner with minimal algorithm time complexity (polynomial time). Our experimental verification on multiple data sets confirms that the proposed file partitioning solution is able to achieve compression ratio close to the optimal compression performance achieved by a centralized solution.
引用
收藏
页数:7
相关论文
共 50 条
  • [41] Lessons for big-data projects
    Ewan Birney
    [J]. Nature, 2012, 489 : 49 - 51
  • [42] Exploiting In-memory Systems for Genomic Data Analysis
    Shah, Zeeshan Ali
    El-Kalioby, Mohamed
    Faquih, Tariq
    Shokrof, Moustafa
    Subhani, Shazia
    Alnakhli, Yasser
    Aljafar, Hussain
    Anjum, Ashiq
    Abouelhoda, Mohamed
    [J]. BIOINFORMATICS AND BIOMEDICAL ENGINEERING, IWBBIO 2018, PT I, 2018, 10813 : 405 - 414
  • [43] Survey of In-memory Big Data Analytics and Latest Research Opportunities
    Gangarde, Rupali
    Pawar, Ambika
    Dani, Ajay
    [J]. 2016 FOURTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2016, : 197 - 201
  • [44] MemepiC: Towards a Unified In-Memory Big Data Management System
    Cai, Qingchao
    Zhang, Hao
    Guo, Wentian
    Chen, Gang
    Ooi, Beng Chin
    Tan, Kian-Lee
    Wong, Weng-Fai
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2019, 5 (01) : 4 - 17
  • [45] Timo: In-Memory Temporal Query Processing for Big Temporal Data
    Zheng, Xiao
    Liu, Hou-kai
    Wei, Lin-na
    Wu, Xuan-gou
    Zhang, Zhen
    [J]. 2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 121 - 126
  • [46] In-Memory Computing Architectures for Big Data and Machine Learning Applications
    Snasel, Vaclav
    Tran Khanh Dang
    Pham, Phuong N. H.
    Kueng, Josef
    Kong, Lingping
    [J]. FUTURE DATA AND SECURITY ENGINEERING. BIG DATA, SECURITY AND PRIVACY, SMART CITY AND INDUSTRY 4.0 APPLICATIONS, FDSE 2022, 2022, 1688 : 19 - 33
  • [47] Timo: In-memory temporal query processing for big temporal data
    Zheng, Xiao
    Liu, Houkai
    Wang, Xiujun
    Wu, Xuangou
    Yu, Feng
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (13):
  • [48] Deduplication on Encrypted Big Data in Cloud
    Yan, Zheng
    Ding, Wenxiu
    Yu, Xixun
    Zhu, Haiqi
    Deng, Robert H.
    [J]. IEEE Transactions on Big Data, 2016, 2 (02): : 138 - 150
  • [49] Mille Cheval: a GPU-based in-memory high-performance computing framework for accelerated processing of big-data streams
    Kumar, Vivek
    Sharma, Dilip Kumar
    Mishra, Vinay Kumar
    [J]. JOURNAL OF SUPERCOMPUTING, 2021, 77 (07): : 6936 - 6960
  • [50] Big data for online learning systems
    Dahdouh, Karim
    Dakkak, Ahmed
    Oughdir, Lahcen
    Messaoudi, Faycal
    [J]. EDUCATION AND INFORMATION TECHNOLOGIES, 2018, 23 (06) : 2783 - 2800