Rebound: Scalable Checkpointing for Coherent Shared Memory

被引:0
|
作者
Agarwal, Rishi [1 ]
Garg, Pranav [1 ]
Torrellas, Josep [1 ]
机构
[1] Univ Illinois, Urbana, IL 61801 USA
关键词
Scalable Checkpointing; Shared-Memory Multiprocessors; Faults;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory-based cache coherence. Rebound lever-ages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.
引用
收藏
页码:153 / 164
页数:12
相关论文
共 50 条
  • [1] Revisiting Scalable Coherent Shared Memory
    Bell, C. Gordon
    Nassi, Ike
    [J]. COMPUTER, 2018, 51 (01) : 40 - 49
  • [2] Checkpointing distributed shared memory
    Silva, LM
    Silva, JG
    [J]. JOURNAL OF SUPERCOMPUTING, 1997, 11 (02): : 137 - 158
  • [3] Checkpointing Distributed Shared Memory
    Luis M. Silva
    João Gabriel Silva
    [J]. The Journal of Supercomputing, 1997, 11 : 137 - 158
  • [4] Checkpointing speculative distributed shared memory
    Danilecki, Arkadiusz
    Kobusinska, Anna
    Szychowiak, Michal
    [J]. PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2006, 3911 : 9 - 16
  • [5] Portable transparent checkpointing for distributed shared memory
    Silva, LM
    Silva, JG
    Chapple, S
    [J]. PROCEEDINGS OF THE FIFTH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, 1996, : 422 - 431
  • [6] EveCheck: An Event-Driven, Scalable Algorithm for Coherent Shared Memory Verification
    Graf, Marleson
    Andrade, Gabriel A. G.
    dos Santos, Luiz C. V.
    [J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (02) : 683 - 696
  • [7] Checkpointing and recovery of shared memory parallel applications in a cluster
    Badrinath, R
    Morin, C
    Vallée, G
    [J]. CCGRID 2003: 3RD IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2003, : 471 - 478
  • [8] Application-level checkpointing for shared memory programs
    Bronevetsky, G
    Marques, D
    Pingali, K
    Szwed, P
    Schulz, M
    [J]. ACM SIGPLAN NOTICES, 2004, 39 (11) : 235 - 247
  • [9] A SCALABLE DISTRIBUTED SHARED MEMORY
    MURER, S
    FARBER, P
    [J]. LECTURE NOTES IN COMPUTER SCIENCE, 1992, 634 : 453 - 466
  • [10] A checkpointing algorithm for an SCI based distributed shared memory system
    Kalaiselvi, S
    Rajaraman, V
    [J]. MICROPROCESSORS AND MICROSYSTEMS, 1999, 22 (09) : 515 - 522