Lifetime Reliability-Aware Checkpointing Mechanism: Modelling and Analysis

被引:6
|
作者
bin Bandan, Mohamad Imran [1 ]
Bhattacharjee, Subhasis [1 ]
Shafik, Rishad A. [1 ]
Pradhan, Dhiraj K. [1 ]
Mathew, Jimson [1 ]
机构
[1] Univ Bristol, Bristol BS8 1TH, Avon, England
关键词
Checkpointing; fault tolerance; microprocessors; lifetime reliability;
D O I
10.1109/ISED.2013.32
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Checkpointing mechanism is used to tolerate the impact of transient faults through roll-back operation to a previously saved system state. In this paper, we propose a novel checkpointing mechanism that considers fault tolerance in a duplex system in the presence of both transient and permanent faults. The main objective of our proposed mechanism is to extend the lifetime reliability of the duplex system by avoiding or even tolerating permanent faults in microprocessors. In addition, we also propose to migrate tasks from a 'near-to-die' processor to a spare processor under a condition where the current Mean-Time-To-Failure (MTTF) value is less or equal to a pre-determined threshold MTTF value. We validate our proposed mechanism and perform overhead analysis using various case studies. Later, we compare it with one of the most popular existing checkpointing mechanism, namely the roll-forward checkpointing scheme [9]. We show that unlike roll-back or roll-forward mechanisms, our proposed mechanism gives significantly higher lifetime reliability with reasonable system overheads.
引用
下载
收藏
页码:128 / 132
页数:5
相关论文
共 50 条
  • [21] Reliability-Aware Resource Allocation in HPC Systems
    Gottumukkala, Narasimha Raju
    Leangsuksun, Chokchai Box
    Taerat, Narate
    Nassar, Raja
    Scott, Stephen L.
    2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, : 312 - +
  • [22] Reliability-Aware Dynamic Voltage and Frequency Scaling
    Firouzi, F.
    Salehi, M. E.
    Wang, F.
    Fakhraie, S. M.
    Safari, S.
    IEEE ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2010), 2010, : 304 - 309
  • [23] BRAVO: Balanced Reliability-Aware Voltage Optimization
    Swaminathan, Karthik
    Chandramoorthy, Nandhini
    Cher, Chen-Yong
    Bertran, Ramon
    Buyuktosunoglu, Alper
    Bose, Pradip
    2017 23RD IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2017, : 97 - 108
  • [24] On Reliability-Aware Server Consolidation in Cloud Datacenters
    Varasteh, Amir
    Tashtarian, Farzad
    Goudarzi, Maziar
    2017 16TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING (ISPDC-2017), 2017, : 95 - 101
  • [25] Reliability-Aware Scheduling on Heterogeneous Multicore Processors
    Naithani, Ajeya
    Eyerman, Stijn
    Eeckhout, Lieven
    2017 23RD IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2017, : 397 - 408
  • [26] Reliability-aware Virtual Data Center Embedding
    Zuo, Cheng
    Yu, Hongfang
    Anand, Vishal
    2014 6TH INTERNATIONAL WORKSHOP ON RELIABLE NETWORKS DESIGN AND MODELING (RNDM), 2014, : 151 - 157
  • [27] Reliability-aware core partitioning in chip multiprocessors
    Oz, Isil
    Topcuoglu, Haluk Rahmi
    Kandemir, Mahmut
    Tosun, Oguz
    JOURNAL OF SYSTEMS ARCHITECTURE, 2012, 58 (3-4) : 160 - 176
  • [28] Reliability-Aware Distributed Computing Scheduling Policy
    Abawajy, Jemal
    Hassan, Mohammad Mehedi
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2015, 2015, 9532 : 627 - 632
  • [29] A reliability-aware LDPC code decoding algorithm
    Alles, Matthias
    Brack, Torben
    Welm, Norbert
    2007 IEEE 65TH VEHICULAR TECHNOLOGY CONFERENCE, VOLS 1-6, 2007, : 1544 - 1548
  • [30] Joint Latency and Reliability-Aware Controller Placement
    Rasol, Kurdman Abdulrahman Rasol
    Domingo-Pascual, Jordi
    35TH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN 2021), 2021, : 197 - 202