Automatic checkpoint strategy for parallel computing frame with Spark

被引：0

作者：

Ying C. ^{[1
]}

Yu J. ^{[1
,2
]}

Bian C. ^{[1
]}

Lu L. ^{[1
]}

Qian Y. ^{[2
]}

机构：

[1] School of Information Science and Engineering, Xinjiang University, Urumqi

[2] School of Software, Xinjiang University, Urumqi

来源：

Yu, Jiong (yujiong@xju.edu.cn) | 1600年 / Southeast University卷 / 47期

关键词：

Automatic checkpoint; Recovery time; Resilient distribution dataset (RDD) weight; Spark;

D O I：

10.3969/j.issn.1001-0505.2017.02.006

中图分类号：

学科分类号：

摘要：

The existing Spark checkpoint mechanism required the programmer to choose the checkpoint according to the experience, thus it had a certain risk and randomness, resulting in large recovery overhead. To address this problem, the resilient distribution datasets (RDD) characteristics were analyzed, and the weight generated (WG)algorithm and checkpoint automatic selection (CAS) algorithm were put forward.First, in the WG algorithm, the directed acyclic graph (DAG) of the job was analyzed, and the lineage length and the operation complexity of RDD were obtained to compute the RDD weight. Secondly, in the CAS algorithm, the RDD with the maximum weight was selected for setting checkpoints asynchronously to fast recovery. The experimental results show that comparing with the original Spark, the execution time and the checkpoint size of different datasets are increased by the CAS algorithm, while the increasing extent of Wiki-Talk is more obvious. For the single node failure recovery, the datasets have smaller recovery overhead after setting checkpoint by using the CAS algorithm. Therefore, the strategy can efficiently decrease the recovery overhead of jobs with sacrificing the slight extra overhead. © 2017, Editorial Department of Journal of Southeast University. All right reserved.

引用

页码：231 / 235

页数：4

共 10 条

[1] Meng X., Ci X., Big data management: Concepts, techniques and challenges, Journal of Computer Research and Development, 50, 1, pp. 146-169, (2013)
[2] Wang Y., Jin X., Cheng X., Network big data: Present and future, Chinese Journal of Computers, 36, 6, pp. 1125-1138, (2013)
[3] Zaharia M., Chowdhury M., Franklin M.J., Et al., Spark: Cluster computing with working sets, Usenix Conference on Hot Topics in Cloud Computing, pp. 1765-1773, (2010)
[4] Zaharia M., Chowdhury M., Das T., Et al., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, Usenix Conference on Networked Systems Design and Implementation, pp. 141-146, (2012)
[5] Yi H., Wang F., Zuo K., Et al., Asynchronous checkpoint/restart based on memory buffer, Journal of Computer Research and Development, 51, 6, pp. 1229-1239, (2014)
[6] Ci Y., Zhang Z., Zuo D., Et al., Scalable time-based multi-cycle checkpointing, Journal of Software, 21, 2, pp. 218-230, (2010)
[7] Wu J., Fault-tolerant scheduling algorithm for heterogeneous distributed control systems based on dual priorities queues, Journal of Southeast University(Natural Science Edition), 38, 3, pp. 407-412, (2008)
[8] Neumeyer L., Robbins B., Nair A., Et al., S4: Distributed stream computing platform, IEEE International Conference on Data Mining Workshops, pp. 170-177, (2010)
[9] Ongaro D., Rumble S.M., Stutsman R., Et al., Fast crash recovery in RAMCloud, ACM Symposium on Operating Systems Principles, pp. 29-41, (2011)
[10] Li H.Y., Ghodsi A., Zaharia M., Et al., Tachyon: Reliable, memory speed storage for cluster computing frameworks, IEEE Conference on SYSTEM-ON-CHIP, pp. 1-15, (2014)

← 1 →