Exploring and cleaning big data with random sample data blocks

被引：15

作者：

Salloum, Salman ^{[1
,2
]}

Huang, Joshua Zhexue ^{[1
,2
]}

He, Yulin ^{[1
,2
]}

机构：

[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China

[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China

来源：

JOURNAL OF BIG DATA | 2019年 / 6卷 / 01期

基金：

国家重点研发计划; 中国博士后科学基金; 中国国家自然科学基金;

关键词：

Big data; Exploratory data analysis; Statistical estimation; Data cleaning; Block-level sampling; Random sample partition; Distributed; Parallel and cluster computing;

D O I：

10.1186/s40537-019-0205-4

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.

引用

页数：28

共 50 条

[21] EXPLORING OUT-OF-SAMPLE PREDICTION AND SPATIAL DEPENDENCY FOR COMPLEX BIG DATA
Bilchouris, Adam
BULLETIN OF THE AUSTRALIAN MATHEMATICAL SOCIETY, 2025,
[22] Big Data Cleaning Algorithms in Cloud Computing
Feng, Zhang
Hui-Feng, Xue
Dong-Sheng, Xu
Yong-Heng, Zhang
Fei, You
INTERNATIONAL JOURNAL OF ONLINE ENGINEERING, 2013, 9 (03) : 77 - 81
[23] Cleanix: a Parallel Big Data Cleaning System
Wang, Hongzhi
Li, Mingda
Bu, Yingyi
Li, Jianzhong
Gao, Hong
Zhang, Jiacheng
SIGMOD RECORD, 2015, 44 (04) : 35 - 40
[24] Density estimation-based method to determine sample size for random sample partition of big data
Yulin He
Jiaqi Chen
Jiaxing Shen
Philippe Fournier-Viger
Joshua Zhexue Huang
Frontiers of Computer Science, 2024, 18
[25] A Data Cleaning Method for Big Trace Data Using Movement Consistency
Yang, Xue
Tang, Luliang
Zhang, Xia
Li, Qingquan
SENSORS, 2018, 18 (03):
[26] Density estimation-based method to determine sample size for random sample partition of big data
He, Yulin
Chen, Jiaqi
Shen, Jiaxing
Fournier-Viger, Philippe
Huang, Joshua Zhexue
FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (05)
[27] A Big Data Cleaning Method for Drinking-Water Streaming Data
Gai, Rong-Li
Zhang, Hao
Thanh, Dang Ngoc Hoang
BRAZILIAN ARCHIVES OF BIOLOGY AND TECHNOLOGY, 2023, 66
[28] Data cleaning and restoring method for vehicle battery big data platform
Li, Shuangqi
He, Hongwen
Zhao, Pengfei
Cheng, Shuang
APPLIED ENERGY, 2022, 320
[29] A SYSTEMATIC MAPPING REVIEW ON DATA CLEANING METHODS IN BIG DATA ENVIRONMENTS
Iwata, Claudio Keiji
Galegale, Napoleao Verardi
Ito, Marcia
de Azevedo, Marilia Macorin
Feitosa, Marcelo Duduchi
Arima, Carlos Hideo
IADIS-INTERNATIONAL JOURNAL ON COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2024, 19 (02): : 19 - 36
[30] Mining Big Data with Random Forests
Lulli, Alessandro
Oneto, Luca
Anguita, Davide
COGNITIVE COMPUTATION, 2019, 11 (02) : 294 - 316

← 1 2 3 4 5 →