Cluster-based Best Match Scanning for Large-Scale Missing Data Imputation

被引：2

作者：

Yu, Weiqing ^{[1
]}

Zhu, Wendong ^{[1
]}

Liu, Guangyi ^{[1
]}

Kan, Bowen ^{[1
]}

Zhao, Ting ^{[2
]}

Liu, He ^{[2
]}

机构：

[1] GEIRI North Amer, San Jose, CA 95134 USA

[2] GEIRI, Beijing, Peoples R China

来源：

2017 3RD INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM) | 2017年

关键词：

big data; cluster-based best match scanning; data imputation; k-NN;

D O I：

10.1109/BIGCOM.2017.48

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

High-quality data are the prerequisite for analyzing and using big data to guarantee the value of the data. Missing values in data is a common yet challenging problem in data analytics and data mining, especially in the era of big data. Amount of missing values directly affects the data quality. Therefore, it is critical to properly recover missing values in the dataset. This paper presents a new imputation algorithm called Cluster-based Best Match Scanning (CBMS) designed for Big Data. It is a modification of k-NN imputation. CBMS focuses on recovering continuous numeric missing values, and aims at balancing computational complexity and accuracy. As an imputation algorithm, it can potentially reduce the time complexity of k-NN from to, and also reduce the space/memory usage, while perform no worse than k-NN imputation. On top of that CBMS is highly parallelizable. Simulation of CBMS is conducted on smart meter reading data. Data is manually divided into training set and testing set, and testing accuracy is evaluated by computing the mean absolute deviation. Comparison with linear interpolation and k-NN imputation is made to demonstrate the power and effectiveness of our proposed CBMS algorithm.

引用

页码：232 / 238

页数：7

共 50 条

[1] Edge-Based Missing Data Imputation in Large-Scale Environments
Guastella, Davide Andrea
Marcillaud, Guilhem
Valenti, Cesare
[J]. INFORMATION, 2021, 12 (05)
[2] Cluster-based KNN Missing Value Imputation for DNA Microarray Data
Keerin, Phimmarin
Kurutach, Werasak
Boongoen, Tossapon
[J]. PROCEEDINGS 2012 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2012, : 445 - 450
[3] Cluster-based SNP Calling on Large-Scale Genome Sequencing Data
Kutlu, Mucahid
Agrawal, Gagan
[J]. 2014 14TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2014, : 455 - 464
[4] Correlated Cluster-Based Imputation for Treatment of Missing Values
Myneni, Madhu Bala
Srividya, Y.
Dandamudi, Akhil
[J]. PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS, ICCII 2016, 2017, 507 : 171 - 178
[5] Missing data imputation and corrected statistics for large-scale behavioral databases
Courrieu, Pierre
Rey, Arnaud
[J]. BEHAVIOR RESEARCH METHODS, 2011, 43 (02) : 310 - 330
[6] Missing data imputation and corrected statistics for large-scale behavioral databases
Pierre Courrieu
Arnaud Rey
[J]. Behavior Research Methods, 2011, 43 : 310 - 330
[7] GCMR: A GPU Cluster-based MapReduce Framework for Large-scale Data Processing
Guo, Yiru
Liu, Weiguo
Gong, Bin
Voss, Gerrit
Mueller-Wittig, Wolfgang
[J]. 2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC), 2013, : 580 - 586
[8] Distributed Nonparametric Regression Imputation for Missing Response Problems with Large-scale Data
Wang, Ruoyu
Su, Miaomiao
Wang, Qihua
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2023, 24
[9] Dealing with missing values in large-scale studies: microarray data imputation and beyond
Aittokallio, Tero
[J]. BRIEFINGS IN BIOINFORMATICS, 2010, 11 (02) : 253 - 264
[10] A cluster-based decentralized job dispatching for the large-scale cloud
Byungseok Kang
Hyunseung Choo
[J]. EURASIP Journal on Wireless Communications and Networking, 2016

← 1 2 3 4 5 →