Communication Pattern-based Distributed Snapshots in Large-Scale Systems

被引:1
|
作者
Saker, Salem [1 ]
Agbaria, Adnan [2 ]
机构
[1] Univ Haifa, Acad Arab Coll Educ Israel, IL-31999 Haifa, Israel
[2] Univ Haifa, IL-31999 Haifa, Israel
关键词
ROLLBACK-RECOVERY;
D O I
10.1109/IPDPSW.2015.117
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-Scale systems (LSSs) continue to attract more attention from the scientific community for addressing high-performance computing. Providing fault tolerance in distributed systems is a challenge. This challenge doubtlessly becomes more difficult in LSSs. Distributed snapshots are an important building block for distributed systems, and, among other applications, are useful for providing fault tolerance. This paper motivates the need for providing fault tolerance in LSSs and focuses on the limitations behind this provision. It then presents an innovative and scalable distributed snapshots approach for LSSs. In this approach, upon a new snapshot, a process coordinates only with the processes that it has communicated with since the last snapshot. Our protocol improves the Chandy and Lamport distributed snapshot protocol which was presented in 1985. This improvement may enable developers and planners of systems to consider this protocol. We compare the performance of our new approach to the performance of other existing well-known distributed snapshot approaches using stochastic models. The results show that our approach achieves lower overhead with significant improvement.
引用
下载
收藏
页码:1062 / 1071
页数:10
相关论文
共 50 条
  • [31] Robustness of large-scale distributed computer systems
    Khoroshevsky, VG
    EUROSIM '96 - HPCN CHALLENGES IN TELECOMP AND TELECOM: PARALLEL SIMULATION OF COMPLEX SYSTEMS AND LARGE-SCALE APPLICATIONS, 1996, : 141 - 150
  • [32] Legal reliability in large-scale distributed systems
    Sommer, P
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 416 - 421
  • [33] Risk modeling in distributed, large-scale systems
    Grabowski, M
    Merrick, JRW
    Harrald, JR
    Mazzuchi, TA
    van Dorp, JR
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2000, 30 (06): : 651 - 660
  • [34] Designing a Testbed for Large-scale Distributed Systems
    Leng, Christof
    Lehn, Max
    Rehner, Robert
    Buchmann, Alejandro
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (04) : 400 - 401
  • [35] Distributed LMMSE Estimation for Large-Scale Systems Based on Local Information
    Wang, Yan
    Xiong, Junlin
    Ho, Daniel W. C.
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (08) : 8528 - 8536
  • [36] Cluster-based file replication in large-scale distributed systems
    Sandhu, Harjinder
    Zhou, Songnian
    Performance Evaluation Review, 1992, 20 (01):
  • [37] Distributed Control of Networked Large-Scale Systems Based on A Scheduling Middleware
    Lin, Yufeng
    Wang, Jia
    Han, Qing-Long
    Jarvis, Dennis
    IECON 2017 - 43RD ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY, 2017, : 5523 - 5528
  • [38] On the Explicit Solution of Communication Topology Design for Distributed Control of Large-scale Interconnected Systems
    Gusrialdi, Azwirman
    Hirche, Sandra
    2012 AMERICAN CONTROL CONFERENCE (ACC), 2012, : 6370 - 6375
  • [39] Distributed Control of Large-Scale Networked Control Systems With Communication Constraints and Topology Switching
    Zhang, Dan
    Nguang, Sing Kiong
    Yu, Li
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2017, 47 (07): : 1746 - 1757
  • [40] Antenna selection based on large-scale fading for distributed MIMO systems
    施荣华
    Yuan Zexi
    Dong Jian
    Lei Wentai
    Peng Chunhua
    High Technology Letters, 2016, 22 (03) : 233 - 240