Incremental Optimization Method for Periodic Query in Data Warehouse

被引：0

作者：

Kang Y.-L. ^{[1
,2
]}

Li F. ^{[1
]}

Wang L. ^{[1
,2
]}

机构：

[1] State Key Laboratory of Computer Architecture, Institute of Computing Technology, The Chinese Academy of Sciences, Beijing

[2] University of Chinese Academy of Sciences, Beijing

来源：

Li, Feng (lifeng2005@ict.ac.cn) | 1600年 / Chinese Academy of Sciences卷 / 28期

基金：

中国国家自然科学基金;

关键词：

Data warehouse; Incremental optimize; Middle result reusing; Periodic query;

D O I：

10.13328/j.cnki.jos.005107

中图分类号：

学科分类号：

摘要：

Analytical query is an important way to get value from big data in data warehouse. With the growth of data, the same query needs to be executed periodically, which inevitably introduces redundant calculation on historical data. One type of incremental optimization technology reduces redundant calculation by reusing intermediate results of historical data. However it has following problems: 1) it isn't transparent for user; 2) choice of historical result storing/reusing position is not intelligent; and 3) optimization gains is limited. This article designs an incremental optimization method, which is guided by the semantic rules. This method focuses on both user transparency and optimization gains, and extends grammar to support incremental description. Historical result storing/reusing location is firstly chosen by operators' operational semantics and output semantics. Positions are then adjusted according to cost model and physical task's division positions. At last, optimized tasks-DAG is generated with the ability to run in a distributed computing framework (such as MapReduce) periodically. This paper implements a prototype, called HiveInc, based on Apache Hive. Experimental results on TPC-H show that, compared to non-optimization, HiveInc can obtain average 2.93 speed-up and highest 5.78 speed-up. Compared to classical optimization techniques, IncMR and DryadInc, speed-up of 1.69 and 1.61 can be obtained respectively. © Copyright 2017, Institute of Software, the Chinese Academy of Sciences. All rights reserved.

引用

页码：2126 / 2147

页数：21

共 20 条

[1] Facebook process more than 500TB data daily, (2012)
[2] Thusoo A., Sarma J.S., Jain N., Shao Z., Chakka P., Hive: A warehousing solution over a map-reduce framework, Proc. of the VLDB Endowment, 2, 2, pp. 1626-1629, (2013)
[3] Dean J., Ghemawat S., MapReduce: Simplified data processing on large clusters, Proc. of the Operating Systems Design and Implementation, 51, 1, pp. 107-113, (2004)
[4] Zaharia M., Chowdhury M., Das T., Dave A., Ma J., Mccauley M., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, Proc. of the 9th USENIX Conf. on Networked Systems Design and Implementation, 70, (2012)
[5] Chattopadhyay B., Lin L., Liu W., Mittal S., Aragonda P., Lychagina V., Tenzing: A SQL implementation on the MapReduce framework, Proc. of the VLDB Endowment, 4, 12, pp. 1318-1327, (2011)
[6] Peng D., Dabek F., Large-Scale incremental processing using distributed transactions and notifications, Proc. of the 9th USENIX Conf. on Operating Systems Design and Implementation, pp. 1-15, (2010)
[7] Logothetis D., Olston C., Reed B., Webb K.C., Yocum K., Stateful bulk processing for incremental analytics, Proc. of the 1st ACM Symp. on Cloud Computing, pp. 51-62, (2010)
[8] Yan C., Yang X., Yu Z., Li M., Li X., IncMR: Incremental data processing based on MapReduce, Proc. of the 2012 IEEE 5th Int'l Conf. on Cloud Computing, pp. 534-541, (2012)
[9] Bhatotia P., Wieder A., Rodrigues R., Acar U.A., Pasquin R., Incoop: MapReduce for incremental computations, Proc. of the 2nd ACM Symp. on Cloud Computing, pp. 1-14, (2011)
[10] Popa L., Budiu M., Yu Y., Isard M., DryadInc: Reusing work in large-scale computations, Proc. of the 2009 Conf. on Hot Topics in Cloud Computing, (2009)

← 1 2 →