Cluster-based Best Match Scanning for Large-Scale Missing Data Imputation

被引:2
|
作者
Yu, Weiqing [1 ]
Zhu, Wendong [1 ]
Liu, Guangyi [1 ]
Kan, Bowen [1 ]
Zhao, Ting [2 ]
Liu, He [2 ]
机构
[1] GEIRI North Amer, San Jose, CA 95134 USA
[2] GEIRI, Beijing, Peoples R China
关键词
big data; cluster-based best match scanning; data imputation; k-NN;
D O I
10.1109/BIGCOM.2017.48
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
High-quality data are the prerequisite for analyzing and using big data to guarantee the value of the data. Missing values in data is a common yet challenging problem in data analytics and data mining, especially in the era of big data. Amount of missing values directly affects the data quality. Therefore, it is critical to properly recover missing values in the dataset. This paper presents a new imputation algorithm called Cluster-based Best Match Scanning (CBMS) designed for Big Data. It is a modification of k-NN imputation. CBMS focuses on recovering continuous numeric missing values, and aims at balancing computational complexity and accuracy. As an imputation algorithm, it can potentially reduce the time complexity of k-NN from to, and also reduce the space/memory usage, while perform no worse than k-NN imputation. On top of that CBMS is highly parallelizable. Simulation of CBMS is conducted on smart meter reading data. Data is manually divided into training set and testing set, and testing accuracy is evaluated by computing the mean absolute deviation. Comparison with linear interpolation and k-NN imputation is made to demonstrate the power and effectiveness of our proposed CBMS algorithm.
引用
收藏
页码:232 / 238
页数:7
相关论文
共 50 条
  • [1] Edge-Based Missing Data Imputation in Large-Scale Environments
    Guastella, Davide Andrea
    Marcillaud, Guilhem
    Valenti, Cesare
    [J]. INFORMATION, 2021, 12 (05)
  • [2] Cluster-based KNN Missing Value Imputation for DNA Microarray Data
    Keerin, Phimmarin
    Kurutach, Werasak
    Boongoen, Tossapon
    [J]. PROCEEDINGS 2012 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2012, : 445 - 450
  • [3] Cluster-based SNP Calling on Large-Scale Genome Sequencing Data
    Kutlu, Mucahid
    Agrawal, Gagan
    [J]. 2014 14TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2014, : 455 - 464
  • [4] Correlated Cluster-Based Imputation for Treatment of Missing Values
    Myneni, Madhu Bala
    Srividya, Y.
    Dandamudi, Akhil
    [J]. PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS, ICCII 2016, 2017, 507 : 171 - 178
  • [5] Missing data imputation and corrected statistics for large-scale behavioral databases
    Courrieu, Pierre
    Rey, Arnaud
    [J]. BEHAVIOR RESEARCH METHODS, 2011, 43 (02) : 310 - 330
  • [6] Missing data imputation and corrected statistics for large-scale behavioral databases
    Pierre Courrieu
    Arnaud Rey
    [J]. Behavior Research Methods, 2011, 43 : 310 - 330
  • [7] GCMR: A GPU Cluster-based MapReduce Framework for Large-scale Data Processing
    Guo, Yiru
    Liu, Weiguo
    Gong, Bin
    Voss, Gerrit
    Mueller-Wittig, Wolfgang
    [J]. 2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC), 2013, : 580 - 586
  • [8] Distributed Nonparametric Regression Imputation for Missing Response Problems with Large-scale Data
    Wang, Ruoyu
    Su, Miaomiao
    Wang, Qihua
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2023, 24
  • [9] Dealing with missing values in large-scale studies: microarray data imputation and beyond
    Aittokallio, Tero
    [J]. BRIEFINGS IN BIOINFORMATICS, 2010, 11 (02) : 253 - 264
  • [10] A cluster-based decentralized job dispatching for the large-scale cloud
    Byungseok Kang
    Hyunseung Choo
    [J]. EURASIP Journal on Wireless Communications and Networking, 2016