Exploring and cleaning big data with random sample data blocks

被引:15
|
作者
Salloum, Salman [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ]
He, Yulin [1 ,2 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
基金
国家重点研发计划; 中国博士后科学基金; 中国国家自然科学基金;
关键词
Big data; Exploratory data analysis; Statistical estimation; Data cleaning; Block-level sampling; Random sample partition; Distributed; Parallel and cluster computing;
D O I
10.1186/s40537-019-0205-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.
引用
收藏
页数:28
相关论文
共 50 条
  • [31] Mining Big Data with Random Forests
    Alessandro Lulli
    Luca Oneto
    Davide Anguita
    Cognitive Computation, 2019, 11 : 294 - 316
  • [32] Exploring the Benefits and Challenges of Big Data
    Farooq, Usman
    NEW INDUSTRIALIZATION AND URBANIZATION DEVELOPMENT ANNUAL CONFERENCE: THE INTERNATIONAL FORUM ON NEW INDUSTRIALIZATION DEVELOPMENT IN BIG-DATA ERA, 2015, : 606 - 621
  • [33] Exploring Big Data Governance Frameworks
    Al-Badi, Ali
    Tarhini, Ali
    Khan, Asharul Islam
    9TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS (EUSPN-2018) / 8TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE (ICTH-2018), 2018, 141 : 271 - 277
  • [34] EXPLORING THE DATA TURN OF PHILOSOPHY OF LANGUAGE IN THE ERA OF BIG DATA
    Xu, Shasha
    Yang, Qian
    TRANS-FORM-ACAO, 2024, 47 (04):
  • [35] An Incorrect Data Detection Method for Big Data Cleaning of Machinery Condition Monitoring
    Xu, Xuefang
    Lei, Yaguo
    Li, Zeda
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2020, 67 (03) : 2326 - 2336
  • [36] A data cleaning model for electric power big data based on Spark framework
    Qu, Zhao-Yang
    Wang, Yong-Wen
    Wang, Chong
    Qu, Nan
    Yan, Jia
    International Journal of Database Theory and Application, 2016, 9 (03): : 137 - 150
  • [37] Data Cleaning Optimization for Grain Big Data Processing using Task Merging
    Ju, Xingang
    Lian, Feiyu
    Zhang, Yuan
    2019 6TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE 2019), 2019, : 225 - 233
  • [38] What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets
    Kitchin, Rob
    McArdle, Gavin
    BIG DATA & SOCIETY, 2016, 3 (01): : 1 - 10
  • [39] Cleaning Big Data Streams: A Systematic Literature Review
    Alotaibi, Obaid
    Pardede, Eric
    Tomy, Sarath
    Bagui, Sikha
    Iacono, Mauro
    TECHNOLOGIES, 2023, 11 (04)
  • [40] A Flexible Ensemble Algorithm for Big Data Cleaning of PMUs
    Shen, Long
    He, Xin
    Liu, Mingqun
    Qin, Risheng
    Guo, Cheng
    Meng, Xian
    Duan, Ruimin
    FRONTIERS IN ENERGY RESEARCH, 2021, 9