RRPlib: A spark library for representing HDFS blocks as a set of random sample data blocks

被引:6
|
作者
Emara, Tamer Z. [1 ,2 ,3 ]
Huang, Joshua Zhexue [1 ,2 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Guangdong, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Guangdong, Peoples R China
[3] Higher Inst Engn & Technol Kafrelsheikh, Kafrelsheikh, Egypt
基金
中国国家自然科学基金;
关键词
HDFS; Random sample; Data partitioning; Distributed systems;
D O I
10.1016/j.scico.2019.102301
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Analyzing big data is a challenging problem in cluster computing systems especially when the data volume goes beyond the available computing resources. Sampling is the favored solution for such problems. It summarizes or reduces the amount of data, taking into consideration the statistical characteristics of data distribution. However, the traditional method to sample the massive data by drawing record-by-record is a computationally expensive process because a full scan of the whole data is needed to be performed. While if the massive data is partitioned into a set of data blocks with each block is a random sample data block, the processing time for selecting some blocks as a sample (or different samples) is computationally cheaper. The main purpose of the software described in this paper is to represent the HDFS blocks as a set of random sample data blocks which also stored in HDFS. Our empirical results show that the performance of the partitioning operation is acceptable in the real application especially this operation is only performed once, thereby analysis on terabyte data becomes more natural. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页数:7
相关论文
共 10 条
  • [1] Exploring and cleaning big data with random sample data blocks
    Salman Salloum
    Joshua Zhexue Huang
    Yulin He
    Journal of Big Data, 6
  • [2] Exploring and cleaning big data with random sample data blocks
    Salloum, Salman
    Huang, Joshua Zhexue
    He, Yulin
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [3] Block Forests: random forests for blocks of clinical and omics covariate data
    Hornung, Roman
    Wright, Marvin N.
    BMC BIOINFORMATICS, 2019, 20 (1)
  • [4] Block Forests: random forests for blocks of clinical and omics covariate data
    Roman Hornung
    Marvin N. Wright
    BMC Bioinformatics, 20
  • [5] Factorial effects, random blocks, and longitudinal data: Two simple analysis methods
    Engel, J.
    JOURNAL OF QUALITY TECHNOLOGY, 2008, 40 (01) : 97 - 108
  • [6] Feature extraction and selection in Ground Penetrating Radar with experimental data set of inclusions in concrete blocks
    Queiroz, F. A. A.
    Vieira, D. A. G.
    Travassos, X. L.
    Pantoja, M. F.
    2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2, 2012, : 48 - 53
  • [7] Estimating the accumulation and re-accumulation of commercial tobacco, electronic cigarette, and cannabis waste based on a stratified random sample of census blocks
    Matt, Georg E.
    Greiner, Lydia
    Tran, Kristina
    Gibbons, Joseph
    Vingiello, Michael
    Granados, Paula Stigler
    Shadbegian, Ronald
    Novotny, Thomas E.
    PLOS ONE, 2025, 20 (01):
  • [8] Data mining and library generation to search electron-rich and electron-deficient building blocks for the designing of polymers for photoacoustic imaging
    Ishfaq, Muhammad
    Mubashir, Tayyaba
    Abdou, Safaa N.
    Tahir, Mudassir Hussain
    Halawa, Mohamed Ibrahim
    Ibrahim, Mohamed M.
    Xie, Yulin
    HELIYON, 2023, 9 (11)
  • [9] Classification of Hull Blocks of Ships Using CNN with Multi-View Image Set from 3D CAD Data
    Chon, Haemyung
    Oh, Daekyun
    Noh, Jackyou
    JOURNAL OF MARINE SCIENCE AND ENGINEERING, 2023, 11 (02)
  • [10] Search of electron-rich and electron-deficient building blocks through data mining and library generation for the designing of polymers for organic solar cells
    Naeem, Sumaira
    Mubashir, Tayyaba
    Tahir, Mudassir Hussain
    Najeeb, Jawayria
    Dewidar, Ahmed Z.
    El-ansary, Hosam O.
    Lagat, Silas
    Pembere, Anthony
    JOURNAL OF PHOTOCHEMISTRY AND PHOTOBIOLOGY A-CHEMISTRY, 2024, 448