Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

被引:0
|
作者
Pachajoa, Carlos [1 ]
Pacher, Christina [1 ]
Levonyak, Markus [1 ]
Gansterer, Wilfried N. [1 ]
机构
[1] Univ Vienna, Fac Comp Sci, Vienna, Austria
关键词
ITERATIVE METHODS;
D O I
10.1145/3404397.3404438
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, specifically, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpoint-restart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm, rather than explicitly as in CR. We also minimize the amount of data to be stored and retrieved compared to CR, but additional computation is required to reconstruct the solver's state. In this paper, we describe the necessary modifications to ESR to convert it into ESRP, and perform an experimental evaluation. We compare ESRP experimentally with previously-existing ESR and application-level in-memory CR. Our results confirm that the overhead for ESR is reduced significantly, both in the failure-free case, and if node failures are introduced. In the former case, the overhead of ESRP is usually lower than that of CR. However, CR is faster if node failures happen. We claim that these differences can be alleviated by the implementation of more appropriate preconditioners.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Checkpoint-recovery for mobile computing systems
    Morita, Y
    Higaki, H
    [J]. 21ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS, PROCEEDINGS, 2001, : 479 - 484
  • [2] A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI
    Bland, Wesley
    Du, Peng
    Bouteiller, Aurelien
    Herault, Thomas
    Bosilca, George
    Dongarra, Jack
    [J]. EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 477 - 488
  • [3] Checkpoint-recovery protocol for reliable mobile systems
    Higaki, H
    Takizawa, M
    [J]. SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 93 - 99
  • [4] Algorithm-Based Recovery for HPL
    Davies, Teresa
    Chen, Zizhong
    Karlsson, Christer
    Liu, Hui
    [J]. ACM SIGPLAN NOTICES, 2011, 46 (08) : 303 - 304
  • [5] Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes
    Iván Cores
    Gabriel Rodríguez
    Mará J. martín
    Patricia González
    Roberto R. Osorio
    [J]. New Generation Computing, 2013, 31 : 163 - 185
  • [6] Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes
    Cores, Ivan
    Rodriguez, Gabriel
    Martin, Maria J.
    Gonzalez, Patricia
    Osorio, Roberto R.
    [J]. NEW GENERATION COMPUTING, 2013, 31 (03) : 163 - 185
  • [7] The parallel algorithm of conjugate gradient method
    Jordan, A
    Bycul, RP
    [J]. ADVANCED ENVIRONMENTS, TOOLS, AND APPLICATIONS FOR CLUSTER COMPUTING, 2002, 2326 : 156 - 165
  • [8] AN OPTIMAL BOOSTING ALGORITHM BASED ON NONLINEAR CONJUGATE GRADIENT METHOD
    Choi, Jooyeon
    Jeong, Bora
    Park, Yesom
    Seo, Jiwon
    Min, Chohong
    [J]. JOURNAL OF THE KOREAN SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS, 2018, 22 (01) : 1 - 13
  • [9] Large sparse signal recovery by conjugate gradient algorithm based on smoothing technique
    Zhu, Hong
    Xiao, Yunhai
    Wu, Soon-Yi
    [J]. COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2013, 66 (01) : 24 - 32
  • [10] Study of MCG-CMA algorithm based on conjugate gradient method
    Song, Tao
    Huang, Qiang-Nian
    Wei, Shi-Bo
    Li, Guang-Xia
    [J]. Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2008, 37 (04): : 511 - 514