Communication Pattern-based Distributed Snapshots in Large-Scale Systems

被引:1
|
作者
Saker, Salem [1 ]
Agbaria, Adnan [2 ]
机构
[1] Univ Haifa, Acad Arab Coll Educ Israel, IL-31999 Haifa, Israel
[2] Univ Haifa, IL-31999 Haifa, Israel
关键词
ROLLBACK-RECOVERY;
D O I
10.1109/IPDPSW.2015.117
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-Scale systems (LSSs) continue to attract more attention from the scientific community for addressing high-performance computing. Providing fault tolerance in distributed systems is a challenge. This challenge doubtlessly becomes more difficult in LSSs. Distributed snapshots are an important building block for distributed systems, and, among other applications, are useful for providing fault tolerance. This paper motivates the need for providing fault tolerance in LSSs and focuses on the limitations behind this provision. It then presents an innovative and scalable distributed snapshots approach for LSSs. In this approach, upon a new snapshot, a process coordinates only with the processes that it has communicated with since the last snapshot. Our protocol improves the Chandy and Lamport distributed snapshot protocol which was presented in 1985. This improvement may enable developers and planners of systems to consider this protocol. We compare the performance of our new approach to the performance of other existing well-known distributed snapshot approaches using stochastic models. The results show that our approach achieves lower overhead with significant improvement.
引用
收藏
页码:1062 / 1071
页数:10
相关论文
共 50 条
  • [21] Independent recovery in large-scale distributed systems
    Triantafillou, P
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1996, 22 (11) : 812 - 826
  • [22] A dependability layer for large-scale distributed systems
    Cristea, Valentin
    Dobre, C.
    Pop, F.
    Stratan, C.
    Costan, A.
    Leordeanu, C.
    Tirsa, E.
    INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2011, 2 (02) : 109 - 118
  • [23] Failure detectors for large-scale distributed systems
    Hayashibara, N
    Cherif, A
    Katayama, T
    21ST IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2002, : 404 - 409
  • [24] Energy efficiency in large-scale distributed systems
    Tuan Anh Trinh
    Hlavacs, Helmut
    Talia, Domenico
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING AND ESCIENCE, 2012, 28 (05): : 743 - 744
  • [25] Stability of large-scale distributed parameter systems
    Ladde, GS
    Li, TT
    DYNAMIC SYSTEMS AND APPLICATIONS, 2002, 11 (03): : 311 - 323
  • [26] Monitoring and control of large-scale distributed systems
    Legrand, C.
    GRID AND CLOUD COMPUTING: CONCEPTS AND PRACTICAL APPLICATIONS, 2016, 192 : 101 - 151
  • [27] Distributed Orchestration in Large-scale IoT Systems
    Yigitoglu, Emre
    Liu, Ling
    Looper, Margaret
    Pu, Calton
    2017 IEEE 2ND INTERNATIONAL CONGRESS ON INTERNET OF THINGS (IEEE ICIOT), 2017, : 58 - 65
  • [28] Robust Scheduling for Large-Scale Distributed Systems
    Lee, Young Choon
    King, Jayden
    Kim, Young Ki
    Hong, Seok-Hee
    2020 IEEE 19TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2020), 2020, : 38 - 45
  • [29] Robustness of large-scale distributed computer systems
    Khoroshevsky, VG
    EUROSIM '96 - HPCN CHALLENGES IN TELECOMP AND TELECOM: PARALLEL SIMULATION OF COMPLEX SYSTEMS AND LARGE-SCALE APPLICATIONS, 1996, : 141 - 150
  • [30] Analysis of large-scale distributed information systems
    Hellerstein, JL
    Jayram, TS
    Squillante, MS
    8TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS, PROCEEDINGS, 2000, : 164 - 171