Fault-tolerant distributed simulation

被引:14
|
作者
Damani, OP [1 ]
Garg, VK [1 ]
机构
[1] Univ Texas, Dept Comp Sci, Austin, TX 78712 USA
关键词
D O I
10.1109/PADS.1998.685268
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In traditional distributed simulation schemes, entire simulation needs to be restarted if any of the participating LP crashes. This is highly undesirable for long running simulations. Some form of fault-tolerance is required to minimize the wasted computation. In this paper, a rollback based optimistic fault-tolerance scheme is integrated with an optimistic distributed simulation scheme. In rollback recovery schemes, checkpoints are periodically saved on stable storage. After a crash, these saved checkpoints are used to restart the computation. We make use of the novel insight that a failure can be modeled as a straggler event with the receive time equal to the virtual time of the last checkpoint saved on stable storage. This results in saving of implementation efforts, as well as reduced overheads. We define stable global virtual time (SGVT), as the virtual time such that no state with a lower timestamp will ever be rolled back despite crash failures. A simple change is made in existing GVT algorithms to compute SGVT. Our use Of transitive dependency tracking eliminates antimessages. LPs are clubbed in clusters to minimize stable storage access time.
引用
收藏
页码:38 / 45
页数:8
相关论文
共 50 条
  • [1] Fault-Tolerant Adaptive Parallel and Distributed Simulation
    D'Angelo, Gabriele
    Ferretti, Stefano
    Marzolla, Moreno
    Armaroli, Lorenzo
    [J]. 2016 IEEE/ACM 20TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED SIMULATION AND REAL TIME APPLICATIONS (DS-RT), 2016, : 37 - 44
  • [2] Fault-Tolerant Distributed Reconnaissance
    Lauf, Adrian P.
    Robinson, William H.
    [J]. MILITARY COMMUNICATIONS CONFERENCE, 2010 (MILCOM 2010), 2010, : 1812 - 1817
  • [3] UNDERSTANDING FAULT-TOLERANT DISTRIBUTED SYSTEMS
    CRISTIAN, F
    [J]. COMMUNICATIONS OF THE ACM, 1991, 34 (02) : 56 - 78
  • [4] Developing fault-tolerant distributed loops
    Farrag, A. A.
    [J]. INFORMATION PROCESSING LETTERS, 2010, 111 (02) : 97 - 101
  • [5] Recovery in fault-tolerant distributed microcontrollers
    Rennels, DA
    Hwang, R
    [J]. INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2001, : 475 - 480
  • [6] Fault-tolerant distributed systems in hardware
    Hebrew University of Jerusalem, Israel
    不详
    不详
    不详
    [J]. Bull. Eur. Assoc. Theor. Comput. Sci., 116
  • [7] WORKSHOP ON DISTRIBUTED FAULT-TOLERANT COMPUTERS
    GOLDBERG, J
    [J]. COMPUTER, 1977, 10 (03) : 51 - 52
  • [8] Fault-Tolerant Distributed Transactions on Blockchain
    Jagadish, H.V.
    Tamer Özsu, M.
    [J]. Synthesis Lectures on Data Management, 2021, 16 (01): : 1 - 268
  • [9] Fault-tolerant Distributed Systems in Hardware
    Schmid, Stefan
    [J]. BULLETIN OF THE EUROPEAN ASSOCIATION FOR THEORETICAL COMPUTER SCIENCE, 2015, (116): : 111 - 153
  • [10] BIBLIOGRAPHY FOR FAULT-TOLERANT DISTRIBUTED COMPUTING
    COAN, BA
    [J]. LECTURE NOTES IN COMPUTER SCIENCE, 1990, 448 : 274 - 298