Extending the TOKENCMP cache coherence protocol for low overhead fault tolerance in CMP architectures

被引:1
|
作者
Fernandez-Pascual, Ricardo [1 ]
Garcia, Jose M. [1 ]
Acacio, Manuel E. [1 ]
Duato, Jose [2 ]
机构
[1] Univ Murcia, Dept Ingn & Tecnol Computadores, Fac Informat, Murcia 30080, Spain
[2] Univ Politecn Valencia, Dept Informat Sistemas & Comp, Valencia 46022, Spain
关键词
fault tolerance; cache coherence; CMP; transient failures; TOKENCMP;
D O I
10.1109/TPDS.2007.70803
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip multiprocessors (CMPs) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable CMPs. In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using the GEMS full-system simulator, we compare our proposal against a similar protocol without fault tolerance (TOKENCMP). We show that in the absence of failures, our proposal does not introduce overhead in terms of increased execution time over TOKENCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world, without increasing the execution time by more than 15 percent.
引用
收藏
页码:1044 / 1056
页数:13
相关论文
共 31 条
  • [1] A low overhead fault tolerant coherence protocol for CMP architectures
    Fernandez-Pascual, Ricardo
    Garcia, Jose M.
    Acacio, Manuel E.
    Duato, Jose
    THIRTEENTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, PROCEEDINGS, 2007, : 157 - +
  • [2] A fault-tolerant directory-based cache coherence protocol for CMP architectures
    Fernandez-Pascual, Ricardo
    Garcia, Jose M.
    Acacio, Manuel E.
    Duato, Jose
    2008 IEEE INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS & NETWORKS WITH FTCS & DCC, 2008, : 267 - +
  • [3] A Lightweight Causal Message Logging Protocol to Lower Fault Tolerance Overhead
    Yang, Jin-Min
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 392 - 401
  • [4] A multiprocessor scheduling algorithm for low overhead fault-tolerance
    Hashimoto, K
    Tsuchiya, T
    Kikuno, T
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 186 - 194
  • [5] A persistent rescheduled-page cache for low overhead object code compatibility in VLIW architectures
    Conte, TM
    Sathaye, SW
    Banerjia, S
    PROCEEDINGS OF THE 29TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE - MICRO-29, 1996, : 4 - 13
  • [6] COFTA: Hardware-software co-synthesis of heterogeneous distributed embedded system architectures for low overhead fault tolerance
    Dave, BP
    Jha, NK
    TWENTY-SEVENTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1997, : 339 - 348
  • [7] Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors
    Shan, Shuchang
    Hu, Yu
    Li, Xiaowei
    2011 IEEE/IFIP 41ST INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2011, : 291 - 302
  • [8] Egida: an extensible toolkit for low-overhead fault-tolerance
    Univ of Texas at Austin, Austin, United States
    Proc Annu Int Conf Fault Tolerant Comput, (48-55):
  • [9] Egida: An extensible toolkit for low-overhead fault-tolerance
    Rao, S
    Alvisi, L
    Vin, HM
    TWENTY-NINTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1999, : 48 - 55
  • [10] Low-Overhead Fault-Tolerance for the Preconditioned Conjugate Gradient Solver
    Schoell, Alexander
    Braun, Claus
    Kochte, Michael A.
    Wunderlich, Hans-Joachim
    PROCEEDINGS OF THE 2015 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFTS), 2015, : 60 - 65