Exploring and cleaning big data with random sample data blocks

被引:15
|
作者
Salloum, Salman [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ]
He, Yulin [1 ,2 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
基金
国家重点研发计划; 中国博士后科学基金; 中国国家自然科学基金;
关键词
Big data; Exploratory data analysis; Statistical estimation; Data cleaning; Block-level sampling; Random sample partition; Distributed; Parallel and cluster computing;
D O I
10.1186/s40537-019-0205-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.
引用
收藏
页数:28
相关论文
共 50 条
  • [21] EXPLORING OUT-OF-SAMPLE PREDICTION AND SPATIAL DEPENDENCY FOR COMPLEX BIG DATA
    Bilchouris, Adam
    BULLETIN OF THE AUSTRALIAN MATHEMATICAL SOCIETY, 2025,
  • [22] Big Data Cleaning Algorithms in Cloud Computing
    Feng, Zhang
    Hui-Feng, Xue
    Dong-Sheng, Xu
    Yong-Heng, Zhang
    Fei, You
    INTERNATIONAL JOURNAL OF ONLINE ENGINEERING, 2013, 9 (03) : 77 - 81
  • [23] Cleanix: a Parallel Big Data Cleaning System
    Wang, Hongzhi
    Li, Mingda
    Bu, Yingyi
    Li, Jianzhong
    Gao, Hong
    Zhang, Jiacheng
    SIGMOD RECORD, 2015, 44 (04) : 35 - 40
  • [24] Density estimation-based method to determine sample size for random sample partition of big data
    Yulin He
    Jiaqi Chen
    Jiaxing Shen
    Philippe Fournier-Viger
    Joshua Zhexue Huang
    Frontiers of Computer Science, 2024, 18
  • [25] A Data Cleaning Method for Big Trace Data Using Movement Consistency
    Yang, Xue
    Tang, Luliang
    Zhang, Xia
    Li, Qingquan
    SENSORS, 2018, 18 (03):
  • [26] Density estimation-based method to determine sample size for random sample partition of big data
    He, Yulin
    Chen, Jiaqi
    Shen, Jiaxing
    Fournier-Viger, Philippe
    Huang, Joshua Zhexue
    FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (05)
  • [27] A Big Data Cleaning Method for Drinking-Water Streaming Data
    Gai, Rong-Li
    Zhang, Hao
    Thanh, Dang Ngoc Hoang
    BRAZILIAN ARCHIVES OF BIOLOGY AND TECHNOLOGY, 2023, 66
  • [28] Data cleaning and restoring method for vehicle battery big data platform
    Li, Shuangqi
    He, Hongwen
    Zhao, Pengfei
    Cheng, Shuang
    APPLIED ENERGY, 2022, 320
  • [29] A SYSTEMATIC MAPPING REVIEW ON DATA CLEANING METHODS IN BIG DATA ENVIRONMENTS
    Iwata, Claudio Keiji
    Galegale, Napoleao Verardi
    Ito, Marcia
    de Azevedo, Marilia Macorin
    Feitosa, Marcelo Duduchi
    Arima, Carlos Hideo
    IADIS-INTERNATIONAL JOURNAL ON COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2024, 19 (02): : 19 - 36
  • [30] Mining Big Data with Random Forests
    Lulli, Alessandro
    Oneto, Luca
    Anguita, Davide
    COGNITIVE COMPUTATION, 2019, 11 (02) : 294 - 316