Understanding prediction-based partial redundant threading for low-overhead, high-coverage fault tolerance

被引:10
|
作者
Reddy, Vimal K. [1 ]
Parthasarathy, Sailashri [2 ]
Rotenberg, Eric [1 ]
机构
[1] N Carolina State Univ, Dept Elect & Comp Engn, Raleigh, NC 27695 USA
[2] Intel Corp, Architecture Modeling Infrastruct Grp, Hudson, MA 01749 USA
关键词
design; performance; reliability; simultaneous multithreading (SMT); chip multiprocessor (CMP); slipstream processor; transient faults; time redundancy; redundant multithreading; branch prediction; value prediction;
D O I
10.1145/1168918.1168869
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance duplicates instructions only during periods of poor single-thread performance. (ii) ReStore does not explicitly duplicate instructions and instead exploits mispredictions among highly confident branch predictions as symptoms of faults. (iii) Slipstream creates a reduced alternate thread by replacing many instructions with highly confident predictions. We explore PRT as a possible direction for achieving the fault tolerance of full duplication with the performance of single-thread execution. Opportunistic and ReStore yield partial coverage since they are restricted to using only partial duplication or only confident predictions, respectively. Previous analysis of Slipstream fault tolerance was cursory and concluded that only duplicated instructions are covered. In this paper, we attempt to better understand Slipstream's fault tolerance, conjecturing that the mixture of partial duplication and confident predictions actually closely approximates the coverage of full duplication. A thorough dissection of prediction scenarios confirms that faults in nearly 100% of instructions are detectable. Fewer than 0.1% of faulty instructions are not detectable due to coincident faults and mispredictions. Next we show that the current recovery implementation fails to leverage excellent detection capability, since recovery sometimes initiates belatedly, after already retiring a detected faulty instruction. We propose and evaluate a suite of simple microarchitectural alterations to recovery and checking. Using the best alterations, Slipstream can recover from faults in 99% of instructions, compared to only 78% of instructions without alterations. Both results are much higher than predicted by past research, which claims coverage for only duplicated instructions, or 65% of instructions. On an 8-issue SMT processor, Slipstream performs within 1.3% of single-thread execution whereas full duplication slows performance by 14%. A key byproduct of this paper is a novel analysis framework in which every dynamic instruction is considered to be hypothetically faulty, thus not requiring explicit fault injection. Fault coverage is measured in terms of the fraction of candidate faulty instructions that are directly or indirectly detectable before retirement. This framework provides a reliable means to compare coverage of different PRT approaches, avoiding pitfalls of incomplete fault injection experiments. Moreover, one simulation does the work of very many fault injection experiments.
引用
收藏
页码:83 / 94
页数:12
相关论文
共 7 条
  • [1] Buffer-Based High-Coverage and Low-Overhead Request Event Monitoring in the Cloud
    Gao, Kaihui
    Sun, Chen
    Wang, Shuai
    Li, Dan
    Zhou, Yu
    Liu, Hongqiang Harry
    Zhu, Lingjun
    Zhang, Ming
    Deng, Xiang
    Zhou, Cheng
    Lu, Lu
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2023, 31 (04) : 1732 - 1747
  • [2] An Algorithmic Approach to Error Localization and Partial Recomputation for Low-Overhead Fault Tolerance
    Sloan, Joseph
    Kumar, Rakesh
    Bronevetsky, Greg
    2013 43RD ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2013,
  • [3] Low-overhead fault tolerance for high-throughput data processing systems
    Martin, Andre
    Knauth, Thomas
    Creutz, Stephan
    Becker, Diogo
    Weigert, Stefan
    Fetzer, Christof
    Brito, Andrey
    31ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2011), 2011, : 689 - 699
  • [4] High-coverage fault tolerance in real-time systems based on point-to-point communication
    Kim, KH
    Subbaraman, C
    Shokri, E
    1997 HIGH-ASSURANCE ENGINEERING WORKSHOP - PROCEEDINGS, 1997, : 141 - 148
  • [5] Low overhead partial enhanced scan technique for compact and high fault coverage transition delay test patterns
    Wang, Seongmoon
    Wei, Wenlong
    PROCEEDINGS OF THE 13TH IEEE EUROPEAN TEST SYMPOSIUM: ETS 2008, 2008, : 125 - 130
  • [6] Low-Overhead and High-Precision Prediction Model for Content-Based Sensor Search in the Internet of Things
    Zhang, Puning
    Liu, Yuanan
    Wu, Fan
    Liu, Suyan
    Tang, Bihua
    IEEE COMMUNICATIONS LETTERS, 2016, 20 (04) : 720 - 723
  • [7] Hybrid delay scan: A low hardware overhead scan-based delay test technique for high fault coverage and compact test sets
    Wang, S
    Liu, X
    Chakradhar, ST
    DESIGN, AUTOMATION AND TEST IN EUROPE CONFERENCE AND EXHIBITION, VOLS 1 AND 2, PROCEEDINGS, 2004, : 1296 - 1301