Private reliability environments for efficient fault-tolerance in CGRAs

被引:2
|
作者
Jafri, Syed M. A. H. [1 ]
Piestrak, Stanislaw J. [2 ]
Hemani, Ahmed [3 ]
Paul, Kolin [4 ]
Plosila, Juha [1 ]
Tenhunen, Hannu [3 ]
机构
[1] Univ Turku, Turku, Finland
[2] Univ Lorraine, Inst Jean Lamour, CNRS, UMR 7198, Vandoeuvre Les Nancy, France
[3] Royal Inst Technol KTH, Stockholm, Sweden
[4] Indian Inst Technol, Delhi, India
关键词
Fault-tolerance; Reliability; Adaptive systems; Energy-aware systems; Scrubbing; Reconfiguration; Coarse grained reconfigurable arrays; RECONFIGURABLE ARCHITECTURE; DESIGN;
D O I
10.1007/s10617-014-9129-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In the era of platforms hosting multiple applications with variable reliability needs, worst-case platform-wide fault-tolerance decisions are neither optimal nor desirable. As a solution to this problem, designs commonly employ adaptive fault-tolerance strategies that provide each application with the reliability level actually needed. However, in the CGRA domain, the existing schemes either only allow to shift between different levels of modular redundancy (duplication, triplication, etc.) or protect only a particular region of a device (e.g. configuration memory, computation, or data memory). To complement these strategies, we propose private fault-tolerance environments which, in addition to modular redundancy, also provide low cost sub-modular (e.g. residue mod 3) redundancy capable of handling both permanent and temporary faults in configuration memory, computation, communication, and data memory. In addition, we also present adaptive configuration scrubbing techniques which prevent fault accumulation in the configuration memory. Simulation results using a few selected algorithms (FFT, matrix multiplication, and FIR filter) show that the approach proposed is capable of providing flexible protection with energy overhead ranging from 3.125 % to 107 % for different reliability levels. Synthesis results have confirmed that the proposed architecture reduces the area overhead for self-checking (58 %) and fault-tolerant (7.1 %) versions, compared to the state of the art adaptive reliability techniques.
引用
收藏
页码:295 / 327
页数:33
相关论文
共 50 条
  • [31] Using Proactive Fault-Tolerance Approach to Enhance Cloud Service Reliability
    Liu, Jialei
    Wang, Shangguang
    Zhou, Ao
    Kumar, Sathish A. P.
    Yang, Fangchun
    Buyya, Rajkumar
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2018, 6 (04) : 1191 - 1202
  • [32] A Comparative Analysis of Network Dependability, Fault-tolerance, Reliability, Security, and Survivability
    Al-Kuwaiti, M.
    Kyriakopoulos, N.
    Hussein, S.
    [J]. IEEE COMMUNICATIONS SURVEYS AND TUTORIALS, 2009, 11 (02): : 106 - 124
  • [33] Validation of guidance control software requirements specification for reliability and fault-tolerance
    Sheldon, FT
    Kim, HY
    [J]. ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM, 2002 PROCEEDINGS, 2002, : 312 - 318
  • [34] Fault-Tolerance and Reliability of Post-CMOS Systems: a Circuit Perspective
    Stanisavljevic, Milos
    Schmid, Alexandre
    Leblebici, Yusuf
    [J]. 2009 INTERNATIONAL SYMPOSIUM ON INTELLIGENT SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ISPACS 2009), 2009, : 433 - 436
  • [35] HELLENIC FAULT-TOLERANCE FOR ROBOTS
    TOYE, G
    LEIFER, LJ
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 1994, 20 (06) : 479 - 497
  • [36] Fault-tolerance in a Boltzmann machine
    Price, CC
    Hanks, JB
    Stephens, JN
    [J]. 1997 IEEE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, 1997, : 1326 - 1331
  • [37] Simulation relations for fault-tolerance
    Demasi, Ramiro
    Castro, Pablo F.
    Maibaum, Thomas S. E.
    Aguirre, Nazareno
    [J]. FORMAL ASPECTS OF COMPUTING, 2017, 29 (06) : 1013 - 1050
  • [38] FAULT-TOLERANCE IN PARALLEL ARCHITECTURES
    SAMI, MG
    SCARABOTTOLO, N
    [J]. LECTURE NOTES IN COMPUTER SCIENCE, 1987, 272 : 349 - 372
  • [39] PARALLELISM AND FAULT-TOLERANCE IN THE CHORUS
    BANINO, JS
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 1986, 6 (1-2) : 205 - 211
  • [40] Measuring Masking Fault-Tolerance
    Castro, Pablo F.
    D'Argenio, Pedro R.
    Demasi, Ramiro
    Putruele, Luciano
    [J]. TOOLS AND ALGORITHMS FOR THE CONSTRUCTION AND ANALYSIS OF SYSTEMS, PT II, 2019, 11428 : 375 - 392