Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

被引:3
|
作者
Kumari, Priti [1 ]
Kaur, Parmeet [1 ]
机构
[1] Jaypee Inst Informat Technol, Dept CSE IT, Noida, India
关键词
Cloud computing; Fault tolerance; Checkpointing; Message logging; Rollback recovery; BoT application; Distributed application; MODEL;
D O I
10.1007/s11277-020-07949-0
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Cloud computing provides infinite resources and a suitable environment for the execution of large scale computing applications. However, it is also susceptible to frequent failures which can affect users as well as service providers adversely. Therefore, fault tolerance techniques are necessary for the reliable execution of applications in the cloud. This work presents checkpointing based fault tolerance protocols for two types of distributed applications. The first kind of applications is the Bags of Tasks (BoT) applications where an application comprises of a set of independent tasks that do not communicate with each other during execution. Hence, an uncoordinated checkpointing algorithm is proposed for fault tolerance of BoT applications. Subsequently, we consider large scale distributed applications composed of multiple tasks dependent on each other due to inter-task message passing. An uncoordinated checkpointing and message logging protocol is presented for this type of applications. The proposed protocols utilize storage at edge switches in a data center to reduce the bandwidth consumption for saving checkpoints and message logs. Simulation results have demonstrated that the proposed protocols provide an increased rate of successful recoveries from failures and cause lower resource overhead than other contemporary and related schemes.
引用
收藏
页码:1853 / 1877
页数:25
相关论文
共 50 条
  • [1] Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud
    Priti Kumari
    Parmeet Kaur
    [J]. Wireless Personal Communications, 2021, 117 : 1853 - 1877
  • [2] A fault-tolerant architecture for large-scale distributed control systems
    Hilmer, H
    Kochs, HD
    Dittmar, E
    [J]. DISTRIBUTED COMPUTER CONTROL SYSTEMS 1997 (DCCS'97), 1997, : 39 - 44
  • [3] Distributed Fault-Tolerant Control for A Large-Scale Power Generator Network
    Feng, Zhi
    Hu, Guoqiang
    [J]. 2015 AMERICAN CONTROL CONFERENCE (ACC), 2015, : 5521 - 5526
  • [4] A Probabilistic Fault-Tolerant Recovery Mechanism for Task and Result Certification of Large-Scale Distributed Applications
    Chayeh, Rim
    Cerin, Christophe
    Jemni, Mohamed
    [J]. ADVANCES IN GRID AND PERVASIVE COMPUTING, PROCEEDINGS, 2009, 5529 : 471 - +
  • [5] Distributed Fault-Tolerant Control of Large-Scale Systems: An Active Fault Diagnosis Approach
    Boem, Francesca
    Gallo, Alexander J.
    Raimondo, Davide M.
    Parisini, Thomas
    [J]. IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS, 2020, 7 (01): : 288 - 301
  • [6] Fault-tolerant communication in large-scale manipulators
    Kochs, HD
    Geisselhardt, W
    Hilmer, H
    Lenord, M
    [J]. COMPUTER SAFETY, RELIABILITY AND SECURITY, 1998, 1516 : 254 - 266
  • [7] EFFICIENT CHECKPOINTING PROCEDURES FOR FAULT-TOLERANT DISTRIBUTED SYSTEMS
    SALEH, K
    AGARWAL, A
    [J]. MICROPROCESSING AND MICROPROGRAMMING, 1994, 40 (06): : 427 - 438
  • [8] A Fault-Tolerant Environment for Large-Scale Query Processing
    Kurt, Mehmet Can
    Agrawal, Gagan
    [J]. 2012 19TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2012,
  • [9] FAULT-TOLERANT CONTROL OF DYNAMIC LARGE-SCALE SYSTEMS
    VACHTSEVANOS, G
    KIM, YT
    CHRISTODOULOU, M
    [J]. PROCEEDINGS OF THE 1989 AMERICAN CONTROL CONFERENCE, VOLS 1-3, 1989, : 355 - 360
  • [10] FAULT-TOLERANT CONTROL AND DIAGNOSTICS FOR LARGE-SCALE SYSTEMS
    ERYUREK, E
    UPADHYAYA, BR
    [J]. IEEE CONTROL SYSTEMS MAGAZINE, 1995, 15 (05): : 34 - 42