The optimization of the big data cleaning based on task merging

被引：0

作者：

Yang D.-H. ^{[1
,2
]}

Li N.-N. ^{[1
]}

Wang H.-Z. ^{[1
]}

Li J.-Z. ^{[1
]}

Gao H. ^{[1
]}

机构：

[1] School of Computer Science and Technology, Harbin Institute of Technology, Harbin

[2] Academy of Fundamental and Interdisciplinary Sciences, Harbin Institute of Technology, Harbin

来源：

Wang, Hong-Zhi (wangzh@hit.edu.cn) | 1600年 / Science Press卷 / 39期

基金：

中国博士后科学基金; 中国国家自然科学基金;

关键词：

Big data; Data cleaning; Hadoop; MapReduce; Massive data; Multi-task optimization;

D O I：

10.11897/SP.J.1016.2016.00097

中图分类号：

学科分类号：

摘要：

Data quality issues will result in lethal effects of big data applications, so it is needed to clean the big data with the problem of data quality. MapReduce programming framework can take advantage of parallel technology to achieve high scalability for large data cleaning. However, due to the lack of effective design, redundant computation exists in the cleaning process based on MapReduce, resulting in decreased performance. Therefore, the purpose of this paper is to optimize the parallel data cleaning process to improve efficiency. Through research, we found that some data cleaning tasks are often run on the same input file or using the same calculation results. Based on the discovery this paper presents a new optimization techniques-optimization techniques based task combinations. By merging redundant computation and several simple calculations for the same input file, we can reduce the number of rounds of MapReduce system thereby reducing the running time, and ultimately achieve system optimization. In this paper, some complex modules of data cleaning process have been optimized, respectively entity recognition module, inconsistent data recovery module, and the module of missing values filling. The experimental results show that the proposed strategy in this paper can effectively improve the efficiency of data cleaning. © 2016, Science Press. All right reserved.

引用

页码：97 / 108

页数：11

共 21 条

[1] Redman T.C., The impact of poor data quality on the typical enterprise, Communications of the ACM, 41, 2, pp. 79-82, (1998)
[2] Miller D.W., Yeast J.D., Evans R.L., Missing prenatal records at a birth center: A communication problem quantified, Proceedings of the AMIA Annual Symposium, pp. 535-539, (2005)
[3] Swartz N., Gartner warns firms of 'dirty data, Information Management Journal, 41, 3, (2007)
[4] Kohn L.T., Corrigan J.M., Donaldson M.S., To Err Is Human: Building a Safer Health System, (2000)
[5] Dallachiesa M., Et al., NADEEF: A commodity data cleaning system, Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 541-552, (2013)
[6] Batini C., Cappiello C., Francalanci C., Et al., Methodologies for data quality assessment and improvement, ACM Computing Surveys, 41, 3, (2009)
[7] Hellerstein J.M., Quantitative data cleaning for large databases, (2008)
[8] Beskales G., Et al., On the relative trust between inconsistent data and inaccurate constraints, Proceedings of the International Conference on Data Engineering, pp. 541-552, (2013)
[9] Fan W., Li J., Ma S., Et al., Interaction between record matching and data repairing, Journal of Data and Information Quality, 4, 4, (2014)
[10] Fan W., Geerts F., Tang N., Yu W., Inferring data currency and consistency for conflict resolution, Proceedings of the International Conference on Data Engineering, pp. 470-481, (2013)

← 1 2 3 →