Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

被引:3
|
作者
Kumari, Priti [1 ]
Kaur, Parmeet [1 ]
机构
[1] Jaypee Inst Informat Technol, Dept CSE IT, Noida, India
关键词
Cloud computing; Fault tolerance; Checkpointing; Message logging; Rollback recovery; BoT application; Distributed application; MODEL;
D O I
10.1007/s11277-020-07949-0
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Cloud computing provides infinite resources and a suitable environment for the execution of large scale computing applications. However, it is also susceptible to frequent failures which can affect users as well as service providers adversely. Therefore, fault tolerance techniques are necessary for the reliable execution of applications in the cloud. This work presents checkpointing based fault tolerance protocols for two types of distributed applications. The first kind of applications is the Bags of Tasks (BoT) applications where an application comprises of a set of independent tasks that do not communicate with each other during execution. Hence, an uncoordinated checkpointing algorithm is proposed for fault tolerance of BoT applications. Subsequently, we consider large scale distributed applications composed of multiple tasks dependent on each other due to inter-task message passing. An uncoordinated checkpointing and message logging protocol is presented for this type of applications. The proposed protocols utilize storage at edge switches in a data center to reduce the bandwidth consumption for saving checkpoints and message logs. Simulation results have demonstrated that the proposed protocols provide an increased rate of successful recoveries from failures and cause lower resource overhead than other contemporary and related schemes.
引用
收藏
页码:1853 / 1877
页数:25
相关论文
共 50 条
  • [21] SPECIAL ISSUE - FAULT-TOLERANT DISTRIBUTED ALGORITHMS
    STRONG, HR
    MATHEMATICAL SYSTEMS THEORY, 1993, 26 (01): : 1 - 1
  • [22] CDMCR: multi-level fault-tolerant system for distributed applications in cloud
    Qiang, Weizhong
    Jiang, Changqing
    Ran, Longbo
    Zou, Deqing
    Jin, Hai
    SECURITY AND COMMUNICATION NETWORKS, 2016, 9 (15) : 2766 - 2778
  • [23] Component Ranking for Fault-Tolerant Cloud Applications
    Zheng, Zibin
    Zhou, Tom Chao
    Lyu, Michael R.
    King, Irwin
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2012, 5 (04) : 540 - 550
  • [24] An approach to fault-tolerant mobile agent execution in distributed systems
    Mohammadi, K.
    Hamidi, H.
    2005 1ST IEEE/IFIP INTERNATIONAL CONFERENCE IN CENTRAL ASIA ON INTERNET (ICI), 2005, : 159 - 163
  • [25] Modeling of fault-tolerant mobile agents execution in distributed systems
    Mohammadi, K
    Hamidi, H
    2005 SYSTEMS COMMUNICATIONS, PROCEEDINGS: ICW 2005, WIRELESS TECHNOLOGIES; ICHSN 2005, HIGH SPEED NETWORKS; ICMCS 2005, MULTIMEDIA COMMUNICATIONS SYSTEMS; SENET 2005, SENSOR NETWORKS, 2005, : 56 - 60
  • [26] Fault-Tolerant Query Execution over Distributed Bitmap Indices
    Burdick, Sam
    Risner, Jahrme
    Chiu, David
    Sawin, Jason
    2018 IEEE/ACM 5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING APPLICATIONS AND TECHNOLOGIES (BDCAT), 2018, : 21 - 30
  • [27] Energy-Efficient and Fault-Tolerant Distributed Mobile Execution
    Kwon, Young-Woo
    Tilevich, Eli
    2012 IEEE 32ND INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2012, : 586 - 595
  • [28] Hierarchical system mapping for large-scale fault-tolerant quantum computing
    Hwang, Yongsoo
    Choi, Byung-Soo
    QUANTUM INFORMATION PROCESSING, 2021, 20 (06)
  • [29] Toward large-scale fault-tolerant universal photonic quantum computing
    Takeda, S.
    Furusawa, A.
    APL PHOTONICS, 2019, 4 (06)
  • [30] Hierarchical system mapping for large-scale fault-tolerant quantum computing
    Yongsoo Hwang
    Byung-Soo Choi
    Quantum Information Processing, 2021, 20