Exploring and cleaning big data with random sample data blocks

被引：15

作者：

Salloum, Salman ^{[1
,2
]}

Huang, Joshua Zhexue ^{[1
,2
]}

He, Yulin ^{[1
,2
]}

机构：

[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China

[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China

来源：

JOURNAL OF BIG DATA | 2019年 / 6卷 / 01期

基金：

国家重点研发计划; 中国博士后科学基金; 中国国家自然科学基金;

关键词：

Big data; Exploratory data analysis; Statistical estimation; Data cleaning; Block-level sampling; Random sample partition; Distributed; Parallel and cluster computing;

D O I：

10.1186/s40537-019-0205-4

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.

引用

页数：28

共 50 条

[31] Mining Big Data with Random Forests
Alessandro Lulli
Luca Oneto
Davide Anguita
Cognitive Computation, 2019, 11 : 294 - 316
[32] Exploring the Benefits and Challenges of Big Data
Farooq, Usman
NEW INDUSTRIALIZATION AND URBANIZATION DEVELOPMENT ANNUAL CONFERENCE: THE INTERNATIONAL FORUM ON NEW INDUSTRIALIZATION DEVELOPMENT IN BIG-DATA ERA, 2015, : 606 - 621
[33] Exploring Big Data Governance Frameworks
Al-Badi, Ali
Tarhini, Ali
Khan, Asharul Islam
9TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS (EUSPN-2018) / 8TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE (ICTH-2018), 2018, 141 : 271 - 277
[34] EXPLORING THE DATA TURN OF PHILOSOPHY OF LANGUAGE IN THE ERA OF BIG DATA
Xu, Shasha
Yang, Qian
TRANS-FORM-ACAO, 2024, 47 (04):
[35] An Incorrect Data Detection Method for Big Data Cleaning of Machinery Condition Monitoring
Xu, Xuefang
Lei, Yaguo
Li, Zeda
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2020, 67 (03) : 2326 - 2336
[36] A data cleaning model for electric power big data based on Spark framework
Qu, Zhao-Yang
Wang, Yong-Wen
Wang, Chong
Qu, Nan
Yan, Jia
International Journal of Database Theory and Application, 2016, 9 (03): : 137 - 150
[37] Data Cleaning Optimization for Grain Big Data Processing using Task Merging
Ju, Xingang
Lian, Feiyu
Zhang, Yuan
2019 6TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE 2019), 2019, : 225 - 233
[38] What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets
Kitchin, Rob
McArdle, Gavin
BIG DATA & SOCIETY, 2016, 3 (01): : 1 - 10
[39] Cleaning Big Data Streams: A Systematic Literature Review
Alotaibi, Obaid
Pardede, Eric
Tomy, Sarath
Bagui, Sikha
Iacono, Mauro
TECHNOLOGIES, 2023, 11 (04)
[40] A Flexible Ensemble Algorithm for Big Data Cleaning of PMUs
Shen, Long
He, Xin
Liu, Mingqun
Qin, Risheng
Guo, Cheng
Meng, Xian
Duan, Ruimin
FRONTIERS IN ENERGY RESEARCH, 2021, 9

← 1 2 3 4 5 →