Communication Pattern-based Distributed Snapshots in Large-Scale Systems

被引:1
|
作者
Saker, Salem [1 ]
Agbaria, Adnan [2 ]
机构
[1] Univ Haifa, Acad Arab Coll Educ Israel, IL-31999 Haifa, Israel
[2] Univ Haifa, IL-31999 Haifa, Israel
关键词
ROLLBACK-RECOVERY;
D O I
10.1109/IPDPSW.2015.117
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-Scale systems (LSSs) continue to attract more attention from the scientific community for addressing high-performance computing. Providing fault tolerance in distributed systems is a challenge. This challenge doubtlessly becomes more difficult in LSSs. Distributed snapshots are an important building block for distributed systems, and, among other applications, are useful for providing fault tolerance. This paper motivates the need for providing fault tolerance in LSSs and focuses on the limitations behind this provision. It then presents an innovative and scalable distributed snapshots approach for LSSs. In this approach, upon a new snapshot, a process coordinates only with the processes that it has communicated with since the last snapshot. Our protocol improves the Chandy and Lamport distributed snapshot protocol which was presented in 1985. This improvement may enable developers and planners of systems to consider this protocol. We compare the performance of our new approach to the performance of other existing well-known distributed snapshot approaches using stochastic models. The results show that our approach achieves lower overhead with significant improvement.
引用
收藏
页码:1062 / 1071
页数:10
相关论文
共 50 条
  • [41] State estimation-based distributed model predictive control of large-scale networked systems with communication delays
    Razavinasab, Zahra
    Farsangi, Malihe M.
    Barkhordari, Mojtaba
    IET CONTROL THEORY AND APPLICATIONS, 2017, 11 (15): : 2497 - 2505
  • [42] Independent global snapshots in large distributed systems
    Sreenivas, MV
    Bhalla, S
    FOURTH INTERNATIONAL CONFERENCE ON HIGH-PERFORMANCE COMPUTING, PROCEEDINGS, 1997, : 462 - 467
  • [43] Electronic document management systems and distributed large-scale systems
    Orlov, V. L.
    Kurako, E. A.
    2017 TENTH INTERNATIONAL CONFERENCE MANAGEMENT OF LARGE-SCALE SYSTEM DEVELOPMENT (MLSD), 2017,
  • [44] An extensible pattern-based library and taxonomy of security threats for distributed systems
    Uzunov, Anton V.
    Fernandez, Eduardo B.
    COMPUTER STANDARDS & INTERFACES, 2014, 36 (04) : 734 - 747
  • [45] A Distributed Swarm Optimizer With Adaptive Communication for Large-Scale Optimization
    Yang, Qiang
    Chen, Wei-Neng
    Gu, Tianlong
    Zhang, Huaxiang
    Yuan, Huaqiang
    Kwong, Sam
    Zhang, Jun
    IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (07) : 3393 - 3408
  • [46] A Framework for Reputation Management in Large-Scale Distributed Systems
    Mei, Yiduo
    Guan, Shangyuan
    Dong, Xiaoshe
    Ma, Siyuan
    Wang, Zhao
    COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN IV, 2008, 5236 : 455 - 464
  • [47] Interoperability in large-scale distributed information delivery systems
    Liu, L
    Yan, LL
    Özsu, MT
    WORKFLOW MANAGEMENT SYSTEMS AND INTEROPERABILITY, 1998, 164 : 246 - 280
  • [48] Evaluation of distributed recovery in large-scale storage systems
    Xin, Q
    Miller, EL
    Schwarz, TJE
    13TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 2004, : 172 - 181
  • [49] Distributed Bayesian Inference for Large-Scale IoT Systems
    Vlachou, Eleni
    Karras, Aristeidis
    Karras, Christos
    Theodorakopoulos, Leonidas
    Halkiopoulos, Constantinos
    Sioutas, Spyros
    BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (01)
  • [50] Secure Distributed Outsourcing of Large-scale Linear Systems
    Feng, Da
    Zhou, Fucai
    He, Debiao
    Guo, Mengna
    Wu, Qiyu
    2022 IEEE 42ND INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2022), 2022, : 1110 - 1121