A distributed data management system to support large-scale data analysis

被引:15
|
作者
Emara, Tamer Z. [1 ]
Huang, Joshua Zhexue [1 ,2 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Guangdong, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Big data; Distributed and parallel processing; Random sample partition; Randomness; Data management; MAPREDUCE;
D O I
10.1016/j.jss.2018.11.007
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Distributed data management is a key technology to enable efficient massive data processing and analysis in cluster-computing environments. Specifically, in environments where the data volumes are beyond the system capabilities, big data files are required to be summarized by representative samples with the same statistical properties as the whole dataset. This paper proposes a big data management system (BDMS) based on distributed random sample data blocks. It presents a high-level architecture design of the BDMS which extends the current distributed file systems. This system offers certain functionalities for block-level management such as statistically-aware data partitioning, data blocks organization, and data blocks selection. This paper also presents a round-random partitioning scheme to represent a big dataset as a set of non-overlapping data blocks; each block is a random sample of the whole dataset. Based on the presented scheme, two algorithms are introduced as an implementation strategy to convert the HDFS blocks of a big file into a set of random sample data blocks which is also stored in HDFS. The experimental results show that the execution time of partitioning operation is acceptable in the real applications because this operation is only performed once on each input data file. (C) 2018 Elsevier Inc. All rights reserved.
引用
收藏
页码:105 / 115
页数:11
相关论文
共 50 条
  • [1] Geographically distributed data management to support large-scale data analysis
    Emara, Tamer Z.
    Trinh, Thanh
    Huang, Joshua Zhexue
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)
  • [2] Geographically distributed data management to support large-scale data analysis
    Tamer Z. Emara
    Thanh Trinh
    Joshua Zhexue Huang
    [J]. Scientific Reports, 13
  • [3] Watchdog – a workflow management system for the distributed analysis of large-scale experimental data
    Michael Kluge
    Caroline C. Friedel
    [J]. BMC Bioinformatics, 19
  • [4] Watchdog - a workflow management system for the distributed analysis of large-scale experimental data
    Kluge, Michael
    Friedel, Caroline C.
    [J]. BMC BIOINFORMATICS, 2018, 19
  • [5] Distributed Data Strategies to Support Large-Scale Data Analysis Across Geo-Distributed Data Centers
    Emara, Tamer Z.
    Huang, Joshua Zhexue
    [J]. IEEE ACCESS, 2020, 8 : 178526 - 178538
  • [6] Large-scale similarity data management with distributed Metric Index
    Novak, David
    Batko, Michal
    Zezula, Pavel
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2012, 48 (05) : 855 - 872
  • [7] A novel data distribution management scheme to support synchronization in large-scale distributed virtual environments
    Boukerche, A
    McGraw, NJ
    Araujo, RB
    [J]. Proceedings of the 2005 IEEE International Conference on Virtual Environments, Human-Computer Interfaces and Measurement Systems, 2005, : 67 - 72
  • [8] Computational solutions to large-scale data management and analysis
    Schadt, Eric E.
    Linderman, Michael D.
    Sorenson, Jon
    Lee, Lawrence
    Nolan, Garry P.
    [J]. NATURE REVIEWS GENETICS, 2010, 11 (09) : 647 - 657
  • [9] Computational solutions to large-scale data management and analysis
    Eric E. Schadt
    Michael D. Linderman
    Jon Sorenson
    Lawrence Lee
    Garry P. Nolan
    [J]. Nature Reviews Genetics, 2010, 11 : 647 - 657
  • [10] Large-Scale Data Management System Using Data De-duplication System
    Abirami, S.
    Vikraman, Rashmi
    Murugappan, S.
    [J]. PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION TECHNOLOGIES, IC3T 2015, VOL 1, 2016, 379 : 225 - 234