Exploring and cleaning big data with random sample data blocks

被引：15

作者：

Salloum, Salman ^{[1
,2
]}

Huang, Joshua Zhexue ^{[1
,2
]}

He, Yulin ^{[1
,2
]}

机构：

[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China

[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China

来源：

JOURNAL OF BIG DATA | 2019年 / 6卷 / 01期

基金：

国家重点研发计划; 中国博士后科学基金; 中国国家自然科学基金;

关键词：

Big data; Exploratory data analysis; Statistical estimation; Data cleaning; Block-level sampling; Random sample partition; Distributed; Parallel and cluster computing;

D O I：

10.1186/s40537-019-0205-4

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.

引用

页数：28

共 50 条

[41] The Big Hubble Meta-Data Spring Cleaning
Haase, Jones
Durand, Daniel
Fraquelli, Dorothy
McLean, Brian
ASTRONOMICAL DATA ANALYSIS SOFTWARE AND SYSTEMS XXV, 2017, 512 : 149 - 152
[42] Cleaning Framework for Big Data - Object Identification and Linkage
Liu, Hong
Kumar, Ashwin T. K.
Thomas, Johnson P.
2015 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2015, 2015, : 215 - 221
[43] The optimization of the big data cleaning based on task merging
Yang D.-H.
Li N.-N.
Wang H.-Z.
Li J.-Z.
Gao H.
Wang, Hong-Zhi (wangzh@hit.edu.cn), 1600, Science Press (39): : 97 - 108
[44] Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference
Kim, Jae-Kwang
Tam, Siu-Ming
INTERNATIONAL STATISTICAL REVIEW, 2021, 89 (02) : 382 - 401
[45] Adaptive Classification of Big Data Flight Sample
Liu Fei
Yin Zhiping
Huang Qiqing
Zhang Xiayang
Liu Jiapeng
2015 INTERNATIONAL CONFERENCE ON COMPUTER AND COMPUTATIONAL SCIENCES (ICCCS), 2015, : 136 - 141
[46] Progressive Ensemble Learning for in-Sample Data Cleaning
Wang, Jung-Hua
Lee, Shih-Kai
Wang, Ting-Yuan
Chen, Ming-Jer
Hsu, Shu-Wei
IEEE ACCESS, 2024, 12 : 140643 - 140659
[47] SAMPLE DATA AND TRAINING MODULES FOR CLEANING BIODIVERSITY INFORMATION
Cobos, Marlon E.
Jimenez, Laura
Nunez-Penichet, Claudia
Romero-Alvarez, Daniel
Simoes, Marianna
BIODIVERSITY INFORMATICS, 2018, 13 : 49 - 50
[48] Exploring Big Data with Helix: Finding Needles in a Big Haystack
Ellis, Jason
Fokoue, Achille
Hassanzadeh, Oktie
Kementsietsidis, Anastasios
Srinivas, Kavitha
Ward, Michael J.
SIGMOD RECORD, 2014, 43 (04) : 43 - 54
[49] Big Data in organizations: Exploring the adoption of Big Data applications and their impact on organizations in China and the Netherlands
Raab, Jorg
Pang, Yuting
Baaijens, Joan
Zhou, Honggeng
BIG DATA RESEARCH, 2024, 36
[50] Random forest algorithm in big data environment
Liu, Yingchun
Computer Modelling and New Technologies, 2014, 18 (12): : 147 - 151

← 1 2 3 4 5 →