Exploring and cleaning big data with random sample data blocks

被引:15
|
作者
Salloum, Salman [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ]
He, Yulin [1 ,2 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
基金
国家重点研发计划; 中国博士后科学基金; 中国国家自然科学基金;
关键词
Big data; Exploratory data analysis; Statistical estimation; Data cleaning; Block-level sampling; Random sample partition; Distributed; Parallel and cluster computing;
D O I
10.1186/s40537-019-0205-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.
引用
收藏
页数:28
相关论文
共 50 条
  • [41] The Big Hubble Meta-Data Spring Cleaning
    Haase, Jones
    Durand, Daniel
    Fraquelli, Dorothy
    McLean, Brian
    ASTRONOMICAL DATA ANALYSIS SOFTWARE AND SYSTEMS XXV, 2017, 512 : 149 - 152
  • [42] Cleaning Framework for Big Data - Object Identification and Linkage
    Liu, Hong
    Kumar, Ashwin T. K.
    Thomas, Johnson P.
    2015 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2015, 2015, : 215 - 221
  • [43] The optimization of the big data cleaning based on task merging
    Yang D.-H.
    Li N.-N.
    Wang H.-Z.
    Li J.-Z.
    Gao H.
    Wang, Hong-Zhi (wangzh@hit.edu.cn), 1600, Science Press (39): : 97 - 108
  • [44] Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference
    Kim, Jae-Kwang
    Tam, Siu-Ming
    INTERNATIONAL STATISTICAL REVIEW, 2021, 89 (02) : 382 - 401
  • [45] Adaptive Classification of Big Data Flight Sample
    Liu Fei
    Yin Zhiping
    Huang Qiqing
    Zhang Xiayang
    Liu Jiapeng
    2015 INTERNATIONAL CONFERENCE ON COMPUTER AND COMPUTATIONAL SCIENCES (ICCCS), 2015, : 136 - 141
  • [46] Progressive Ensemble Learning for in-Sample Data Cleaning
    Wang, Jung-Hua
    Lee, Shih-Kai
    Wang, Ting-Yuan
    Chen, Ming-Jer
    Hsu, Shu-Wei
    IEEE ACCESS, 2024, 12 : 140643 - 140659
  • [47] SAMPLE DATA AND TRAINING MODULES FOR CLEANING BIODIVERSITY INFORMATION
    Cobos, Marlon E.
    Jimenez, Laura
    Nunez-Penichet, Claudia
    Romero-Alvarez, Daniel
    Simoes, Marianna
    BIODIVERSITY INFORMATICS, 2018, 13 : 49 - 50
  • [48] Exploring Big Data with Helix: Finding Needles in a Big Haystack
    Ellis, Jason
    Fokoue, Achille
    Hassanzadeh, Oktie
    Kementsietsidis, Anastasios
    Srinivas, Kavitha
    Ward, Michael J.
    SIGMOD RECORD, 2014, 43 (04) : 43 - 54
  • [49] Big Data in organizations: Exploring the adoption of Big Data applications and their impact on organizations in China and the Netherlands
    Raab, Jorg
    Pang, Yuting
    Baaijens, Joan
    Zhou, Honggeng
    BIG DATA RESEARCH, 2024, 36
  • [50] Random forest algorithm in big data environment
    Liu, Yingchun
    Computer Modelling and New Technologies, 2014, 18 (12): : 147 - 151