Extending the TOKENCMP cache coherence protocol for low overhead fault tolerance in CMP architectures

被引:1
|
作者
Fernandez-Pascual, Ricardo [1 ]
Garcia, Jose M. [1 ]
Acacio, Manuel E. [1 ]
Duato, Jose [2 ]
机构
[1] Univ Murcia, Dept Ingn & Tecnol Computadores, Fac Informat, Murcia 30080, Spain
[2] Univ Politecn Valencia, Dept Informat Sistemas & Comp, Valencia 46022, Spain
关键词
fault tolerance; cache coherence; CMP; transient failures; TOKENCMP;
D O I
10.1109/TPDS.2007.70803
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip multiprocessors (CMPs) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable CMPs. In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using the GEMS full-system simulator, we compare our proposal against a similar protocol without fault tolerance (TOKENCMP). We show that in the absence of failures, our proposal does not introduce overhead in terms of increased execution time over TOKENCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world, without increasing the execution time by more than 15 percent.
引用
收藏
页码:1044 / 1056
页数:13
相关论文
共 31 条
  • [21] Performance and Fault Tolerance of Preconditioned Iterative Solvers on Low-Power ARM Architectures
    Aliaga, Jose, I
    Catalan, Sandra
    Chalios, Charalampos
    Nikolopoulos, Dimitrios S.
    Quintana-Orti, Enrique S.
    PARALLEL COMPUTING: ON THE ROAD TO EXASCALE, 2016, 27 : 711 - 720
  • [22] A fully distributed quorum consensus method with high fault-tolerance and low communication overhead
    Lin, XM
    THEORETICAL COMPUTER SCIENCE, 1997, 185 (02) : 259 - 275
  • [23] Low-overhead fault-tolerance for Java']Java parallel applications on heterogeneous networked computers
    Ahn, J
    Kim, K
    Hwang, C
    INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, PROCEEDINGS, 1999, : 1770 - 1775
  • [24] Hardware support flexible low overhead fault tolerance scheme in scalable shared-memory multiprocessors
    Liu, F
    Ge, JG
    IEEE-EMBS ASIA PACIFIC CONFERENCE ON BIOMEDICAL ENGINEERING - PROCEEDINGS, PTS 1 & 2, 2000, : 821 - 822
  • [25] FLOWER and FaME: A Low Overhead Bit-level Fault-map and Fault-tolerance Approach for Deeply Scaled Memories
    Kline, Donald, Jr.
    Zhang, Jiangwei
    Melhem, Rami
    Jones, Alex K.
    2020 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2020), 2020, : 356 - 368
  • [26] Device View Redundancy: an adaptive low-overhead fault tolerance mechanism for many-core system
    Jia, Wentao
    Zhang, Chunyuan
    Fu, Jian
    2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC), 2013, : 2080 - 2087
  • [27] A Low-Overhead Radiation Hardening Approach using Approximate Computing and Selective Fault Tolerance Techniques at the Software Level
    Aponte-Moreno, Alexander
    Restrepo-Calle, Felipe
    Pedraza, Cesar
    2019 19TH EUROPEAN CONFERENCE ON RADIATION AND ITS EFFECTS ON COMPONENTS AND SYSTEMS (RADECS), 2022, : 188 - 191
  • [28] COFTA: Hardware-software co-synthesis of heterogeneous distributed embedded systems for low overhead fault tolerance
    Dave, BP
    Jha, NK
    IEEE TRANSACTIONS ON COMPUTERS, 1999, 48 (04) : 417 - 441
  • [29] Low-cost fault-tolerance protocol for large-scale network monitoring
    Ahn, J
    Min, SG
    Choi, YI
    Lee, BS
    COMPUTATIONAL SICENCE - ICCS 2003, PT III, PROCEEDINGS, 2003, 2659 : 504 - 513
  • [30] Understanding prediction-based partial redundant threading for low-overhead, high-coverage fault tolerance
    Reddy, Vimal K.
    Parthasarathy, Sailashri
    Rotenberg, Eric
    ACM SIGPLAN NOTICES, 2006, 41 (11) : 83 - 94