Online Data Deduplication for In-Memory Big-Data Analytic Systems

被引:0
|
作者
Sun, Yushi [1 ]
Zeng, Catherine Y.
Chung, Jaeyoon [2 ]
Huang, Zhe [2 ]
机构
[1] Hong Kong Univ Sci & Technol, Sai Kung, Hong Kong, Peoples R China
[2] Princeton Univ, Princeton, NJ 08544 USA
关键词
CLOUD;
D O I
暂无
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Given a set of files that show a certain degree of similarity, we consider a novel problem of performing data redundancy elimination across a set of distributed worker nodes in a shared-nothing in-memory big data analytic system. The redundancy elimination scheme is designed in a manner that is: (i) space-efficient: the total space needed to store the files is minimized and, (ii) access-isolation: data shuffling among server is also minimized. In this paper, we first show that finding an access-efficient and space optimal solution is an NP-Hard problem. Following this, we present the file partitioning algorithms that locate access-efficient solutions in an incremental manner with minimal algorithm time complexity (polynomial time). Our experimental verification on multiple data sets confirms that the proposed file partitioning solution is able to achieve compression ratio close to the optimal compression performance achieved by a centralized solution.
引用
收藏
页数:7
相关论文
共 50 条
  • [31] SparkNN: A distributed in-memory data partitioning for KNN queries on big spatial data
    Al Aghbari, Zaher
    Ismail, Tasneem
    Kamel, Ibrahim
    [J]. Data Science Journal, 2020, 19 (01) : 1 - 14
  • [32] Online Big Data as a source of analytic information in online research
    Korytnikova, N. V.
    [J]. SOTSIOLOGICHESKIE ISSLEDOVANIYA, 2015, (08): : 14 - +
  • [33] The Linear Estimation Problem and Information in Big-Data Systems
    Golubtsov, P., V
    [J]. AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS, 2018, 52 (02) : 73 - 79
  • [34] Towards Automatic Memory Tuning for In-Memory Big Data Analytics in Clusters
    Koliopoulos, Aris-Kyriakos
    Yiapanis, Paraskevas
    Tekiner, Firat
    Nenadic, Goran
    Keane, John
    [J]. 2016 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2016, 2016, : 353 - 356
  • [35] ARE YOU READY FOR BIG DATA? GOVERNANCE IN BIG-DATA RESEARCH
    Scheepers, Floortje E.
    Deschamps, Peter
    [J]. JOURNAL OF THE AMERICAN ACADEMY OF CHILD AND ADOLESCENT PSYCHIATRY, 2016, 55 (10): : S309 - S309
  • [36] A Data Reconstruction Method for The Big-Data Analysis
    Mito, Masataka
    Murata, Kenya
    Eguchi, Daisuke
    Mori, Yuichiro
    Toyonaga, Masahiko
    [J]. 2018 9TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE AND TECHNOLOGY (ICAST), 2018, : 319 - 323
  • [37] Neurotrauma as a big-data problem
    Huie, J. Russell
    Almeida, Carlos A.
    Ferguson, Adam R.
    [J]. CURRENT OPINION IN NEUROLOGY, 2018, 31 (06) : 702 - 708
  • [38] 'Big-Data' in dermatological research
    Kaliyadan, Feroze
    Chatterjee, Kingshuk
    [J]. INDIAN JOURNAL OF DERMATOLOGY VENEREOLOGY & LEPROLOGY, 2024, 90 (03): : 342 - 344
  • [39] Lessons for big-data projects
    Birney, Ewan
    [J]. NATURE, 2012, 489 (7414) : 49 - 51
  • [40] Data Transfer Scheduling for Maximizing Throughput of Big-Data Computing in Cloud Systems
    Xie, Ruitao
    Jia, Xiaohua
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2018, 6 (01) : 87 - 98