Exploring and cleaning big data with random sample data blocks

被引:15
|
作者
Salloum, Salman [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ]
He, Yulin [1 ,2 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
基金
国家重点研发计划; 中国博士后科学基金; 中国国家自然科学基金;
关键词
Big data; Exploratory data analysis; Statistical estimation; Data cleaning; Block-level sampling; Random sample partition; Distributed; Parallel and cluster computing;
D O I
10.1186/s40537-019-0205-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.
引用
收藏
页数:28
相关论文
共 50 条
  • [1] Exploring and cleaning big data with random sample data blocks
    Salman Salloum
    Joshua Zhexue Huang
    Yulin He
    Journal of Big Data, 6
  • [2] Random Sample Partition: A Distributed Data Model for Big Data Analysis
    Salloum, Salman
    Huan, Joshua Zhexue
    He, Yulin
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2019, 15 (11) : 5846 - 5854
  • [3] Big Data Cleaning
    Tang, Nan
    WEB TECHNOLOGIES AND APPLICATIONS, APWEB 2014, 2014, 8709 : 13 - 24
  • [4] Research on the Technology of Data Cleaning in Big Data
    Feng, Fu-jun
    Yao, Jun-ping
    Li, Xiao-jun
    2018 2ND INTERNATIONAL CONFERENCE ON APPLIED MATHEMATICS, MODELING AND SIMULATION (AMMS 2018), 2018, 305 : 176 - 181
  • [5] Big RDF Data Cleaning
    Tang, Nan
    2015 13TH IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW), 2015, : 77 - 79
  • [6] RRPlib: A spark library for representing HDFS blocks as a set of random sample data blocks
    Emara, Tamer Z.
    Huang, Joshua Zhexue
    SCIENCE OF COMPUTER PROGRAMMING, 2019, 184
  • [7] Data Cleaning Mechanism for Big Data and Cloud Computing
    Rahul, Kumar
    Banyal, R. K.
    PROCEEDINGS OF THE 2019 6TH INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT (INDIACOM), 2019, : 195 - 198
  • [8] Data Cleaning Technique for Security Big Data Ecosystem
    Martinez-Mosquera, Diana
    Lujan-Mora, Sergio
    IOTBDS: PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON INTERNET OF THINGS, BIG DATA AND SECURITY, 2017, : 380 - 385
  • [9] Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks
    Liu, Jie
    Cao, Yijia
    Li, Yong
    Guo, Yixiu
    Deng, Wei
    CSEE JOURNAL OF POWER AND ENERGY SYSTEMS, 2024, 10 (06): : 2528 - 2538
  • [10] A Data Fusion and Data Cleaning System for Smart Grids Big Data
    Lv, Zhining
    Deng, Wei
    Zhang, Zhihan
    Guo, Ningxuan
    Yan, Gangfeng
    2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 802 - 807