Communication Pattern-based Distributed Snapshots in Large-Scale Systems

被引:1
|
作者
Saker, Salem [1 ]
Agbaria, Adnan [2 ]
机构
[1] Univ Haifa, Acad Arab Coll Educ Israel, IL-31999 Haifa, Israel
[2] Univ Haifa, IL-31999 Haifa, Israel
关键词
ROLLBACK-RECOVERY;
D O I
10.1109/IPDPSW.2015.117
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-Scale systems (LSSs) continue to attract more attention from the scientific community for addressing high-performance computing. Providing fault tolerance in distributed systems is a challenge. This challenge doubtlessly becomes more difficult in LSSs. Distributed snapshots are an important building block for distributed systems, and, among other applications, are useful for providing fault tolerance. This paper motivates the need for providing fault tolerance in LSSs and focuses on the limitations behind this provision. It then presents an innovative and scalable distributed snapshots approach for LSSs. In this approach, upon a new snapshot, a process coordinates only with the processes that it has communicated with since the last snapshot. Our protocol improves the Chandy and Lamport distributed snapshot protocol which was presented in 1985. This improvement may enable developers and planners of systems to consider this protocol. We compare the performance of our new approach to the performance of other existing well-known distributed snapshot approaches using stochastic models. The results show that our approach achieves lower overhead with significant improvement.
引用
下载
收藏
页码:1062 / 1071
页数:10
相关论文
共 50 条
  • [11] Platform techniques of TMN for large-scale communication systems development with distributed node systems
    Inamori, H
    Ueda, K
    Kishi, T
    IEEE GLOBECOM 1996 - CONFERENCE RECORD, VOLS 1-3: COMMUNICATIONS: THE KEY TO GLOBAL PROSPERITY, 1996, : 141 - 146
  • [12] Component-based design of large-scale distributed systems
    Barbier, F
    25TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE & APPLICATIONS CONFERENCE, 2001, : 19 - 24
  • [13] Communication interval selection in distributed heterogeneous simulation of large-scale dynamical systems
    Lucas, CE
    Walters, EA
    Jatskevich, J
    Wasynczuk, O
    Krause, PC
    Lafayette, W
    Lamm, PT
    ENABLING TECHNOLOGIES FOR SIMULATION SCIENCE VII, 2003, 5091 : 86 - 97
  • [14] Group communication for large-scale distributed systems over IP multicast networks
    Mathur, AG
    INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-IV, PROCEEDINGS, 1998, : 710 - 717
  • [15] Towards rapid redesign: pattern-based redesign planning for large-scale and complex redesign problems
    Li, Simon
    Chen, Li
    JOURNAL OF MECHANICAL DESIGN, 2007, 129 (02) : 227 - 233
  • [16] Towards rapid redesign - Pattern-based design diagnostics for large-scale and complex redesign problems
    Chen, Li
    Macwan, Ashish
    Proceedings of the ASME International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2005, Vol 2, Pts A and B, 2005, : 979 - 988
  • [17] Towards rapid redesign - Pattern-based redesign planning for large-scale and complex redesign problems
    Chen, Li
    Li, Simon
    Proceedings of the ASME International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2005, Vol 2, Pts A and B, 2005, : 989 - 998
  • [18] Distributed consensus-based estimation and control of large-scale systems under gossip communication protocol
    Yu, Tao
    Xiong, Junlin
    JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2020, 357 (14): : 10010 - 10026
  • [19] Dynamic balancing of communication and computation load for HLA-based simulations on large-scale distributed systems
    De Grande, Robson E.
    Boukerche, Azzedine
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2011, 71 (01) : 40 - 52
  • [20] Stability of large-scale distributed parameter systems
    Ladde, GS
    Li, TT
    DYNAMIC SYSTEMS AND APPLICATIONS, 2002, 11 (03): : 311 - 323