Checkpointing distributed shared memory

被引:0
|
作者
Silva, LM
Silva, JG
机构
[1] Departamento Engenharia Informática, Universidade de Coimbra, P-3030 - Coimbra, POLO II - Vila Franca
来源
JOURNAL OF SUPERCOMPUTING | 1997年 / 11卷 / 02期
关键词
distributed shared memory; checkpointing; fault-tolerance; portability;
D O I
10.1023/A:1007959906858
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Distributed shared memory (DSM) is a very promising programming model for exploiting the parallelism of distributed memory systems, because it provides a higher level of abstraction than simple message passing. Although the nodes of standard distributed systems exhibit high crash rates only very few DSM environments have some kind of support for fault-tolerance. In this article, we present a checkpointing mechanism for a DSM system that is efficient and portable. It offers some portability because it is built on top of MPI and uses only the services offered by MPI and a POSIX compliant local file system. As far as we know, this is the first real implementation of such a scheme for DSM. Along with the description of the algorithm we present experimental results obtained in a cluster of workstations. We hope that our research shows that efficient, transparent and portable checkpointing is viable for DSM systems.
引用
收藏
页码:137 / 158
页数:22
相关论文
共 50 条
  • [1] Checkpointing Distributed Shared Memory
    Luis M. Silva
    João Gabriel Silva
    [J]. The Journal of Supercomputing, 1997, 11 : 137 - 158
  • [2] Checkpointing speculative distributed shared memory
    Danilecki, Arkadiusz
    Kobusinska, Anna
    Szychowiak, Michal
    [J]. PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2006, 3911 : 9 - 16
  • [3] Portable transparent checkpointing for distributed shared memory
    Silva, LM
    Silva, JG
    Chapple, S
    [J]. PROCEEDINGS OF THE FIFTH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, 1996, : 422 - 431
  • [4] A checkpointing algorithm for an SCI based distributed shared memory system
    Kalaiselvi, S
    Rajaraman, V
    [J]. MICROPROCESSORS AND MICROSYSTEMS, 1999, 22 (09) : 515 - 522
  • [5] Rebound: Scalable Checkpointing for Coherent Shared Memory
    Agarwal, Rishi
    Garg, Pranav
    Torrellas, Josep
    [J]. ISCA 2011: PROCEEDINGS OF THE 38TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, 2011, : 153 - 164
  • [6] Checkpointing and recovery of shared memory parallel applications in a cluster
    Badrinath, R
    Morin, C
    Vallée, G
    [J]. CCGRID 2003: 3RD IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2003, : 471 - 478
  • [7] Application-level checkpointing for shared memory programs
    Bronevetsky, G
    Marques, D
    Pingali, K
    Szwed, P
    Schulz, M
    [J]. ACM SIGPLAN NOTICES, 2004, 39 (11) : 235 - 247
  • [8] Distributed shared memory integration
    Geva, Mordechai
    Wiseman, Yair
    [J]. IRI 2007: PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2007, : 146 - +
  • [9] HETEROGENEOUS DISTRIBUTED SHARED MEMORY
    ZHOU, SN
    STUMM, M
    LI, K
    WORTMAN, D
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1992, 3 (05) : 540 - 554
  • [10] PROGRAMMING WITH DISTRIBUTED SHARED MEMORY
    RAMACHANDRAN, U
    KHALIDI, MYA
    [J]. PROCEEDINGS : THE THIRTEENTH ANNUAL INTERNATIONAL COMPUTER SOFTWARE & APPLICATIONS CONFERENCE, 1989, : 176 - 183