Experiences with software-based soft-error mitigation using AN codes

被引:0
|
作者
Hoffmann, Martin [1 ]
Ulbrich, Peter [1 ]
Dietrich, Christian [1 ]
Schirmeier, Horst [2 ]
Lohmann, Daniel [1 ]
Schroeder-Preikschat, Wolfgang [1 ]
机构
[1] Univ Erlangen Nurnberg, Chair Distributed Syst & Operating Syst, D-91058 Erlangen, Germany
[2] Tech Univ Dortmund, Dept Comp Sci 12, D-44221 Dortmund, Germany
关键词
Fault injection; Arithmetic code; Dependability; FAULT;
D O I
10.1007/s11219-014-9260-4
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Arithmetic error coding schemes are a well-known and effective technique for soft-error mitigation. Although the underlying coding theory is generally a complex area of mathematics, its practical implementation is comparatively simple in general. However, compliance with the theory can be lost easily while moving toward an actual implementation, which finally jeopardizes the aspired fault-tolerance characteristics and effectiveness. In this paper, we present our experiences and lessons learned from implementing arithmetic error coding schemes (AN codes) in the context of our Combined Redundancy fault-tolerance approach. We focus on the challenges and pitfalls in the transition from maths to machine code for a binary computer from a systems perspective. Our results show that practical misconceptions (such as the use of prime numbers) and architecture-dependent implementation glitches occur at every stage of this transition. We identify typical pitfalls and describe practical measures to find and resolve them. This allowed us to eliminate all remaining silent data corruptions in the Combined Redundancy framework, which we validated by an extensive fault-injection campaign covering the entire fault space of 1-bit and 2-bit errors.
引用
收藏
页码:87 / 113
页数:27
相关论文
共 50 条
  • [41] Soft-error protection of TCAMs based on ECCs and asymmetric SRAM cells
    Gherman, V.
    Cartron, M.
    ELECTRONICS LETTERS, 2014, 50 (24) : 1823 - U187
  • [42] Time redundancy based soft-error tolerance to rescue nanometer technologies
    Nicolaidis, M
    17TH IEEE VLSI TEST SYMPOSIUM, PROCEEDINGS, 1999, : 86 - 94
  • [43] FERRARI - A FLEXIBLE SOFTWARE-BASED FAULT AND ERROR INJECTION SYSTEM
    KANAWATI, GA
    KANAWATI, NA
    ABRAHAM, JA
    IEEE TRANSACTIONS ON COMPUTERS, 1995, 44 (02) : 248 - 260
  • [44] A software-based procedure for robotic end effector error correction
    Moussa, M
    Hill, M
    Fernandes, J
    Karray, F
    1998 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS - PROCEEDINGS, VOLS 1-3: INNOVATIONS IN THEORY, PRACTICE AND APPLICATIONS, 1998, : 1178 - 1182
  • [45] SBanTEM: A Novel Methodology for Sparse Band Tensors as Soft-Error Mitigation in Sparse Convolutional Neural Networks
    Colucci, Alessio
    Steininger, Andreas
    Shafique, Muhammad
    2024 IEEE 30TH INTERNATIONAL SYMPOSIUM ON ON-LINE TESTING AND ROBUST SYSTEM DESIGN, IOLTS 2024, 2024,
  • [46] Program-Invariant Checking for Soft-Error Detection using Reconfigurable Hardware
    Park, Joonseok
    Diniz, Pedro C.
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2015, 9 (01)
  • [47] Soft-Error Vulnerability Estimation Approach Based on the SET Susceptibility of Each Gate
    Armelin, Fabio Batagin
    de Barros Naviner, Lirida Alves
    d'Amore, Roberto
    ELECTRONICS, 2019, 8 (07)
  • [48] Improving Error Detection with S elective Redundancy in Software-based Techniques
    Chielle, Eduardo
    Azambuja, Jose R.
    Barth, Raul S.
    Kastensmidt, Fernanda L.
    2013 14TH IEEE LATIN-AMERICAN TEST WORKSHOP (LATW2013), 2013,
  • [49] FlipSphere: A Software-based DRAM Error Detection and Correction Library for HPC
    Fiala, David
    Mueller, Frank
    Ferreira, Kurt B.
    2016 IEEE/ACM 20TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED SIMULATION AND REAL TIME APPLICATIONS (DS-RT), 2016, : 19 - 28
  • [50] A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC
    Fiala, David
    Ferreira, Kurt B.
    Mueller, Frank
    Engelmann, Christian
    EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT II, 2012, 7156 : 251 - 261