An architecture for tolerating processor failures in shared-memory multiprocessors

被引:11
|
作者
Banatre, M
Gefflaut, A
Joubert, P
Morin, C
Lee, PA
机构
[1] UNIV NEWCASTLE UPON TYNE,DEPT COMP SCI,NEWCASTLE TYNE NE1 7RU,TYNE & WEAR,ENGLAND
[2] SIEMENS AG,OEN SN EBD 13,D-81359 MUNICH,GERMANY
关键词
shared memory multiprocessor; fault tolerance; stable storage; backward error recovery; simulation; performance;
D O I
10.1109/12.543705
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper focuses on the problem of fault tolerance in shared memory multiprocessors, and describes an architecture designed for transparently tolerating processor failures. The Recoverable Shared Memory (RSM) is the novel component of this architecture, providing a hardware supported backward error recovery mechanism which minimizes the propagation of recovery when a processor fails. The RSM permits a shared memory multiprocessor to be constructed using standard caches and cache coherence protocols, and does not require any changes to be made to applications software. The performance of the recovery scheme supported by the RSM is evaluated and compared with other schemes that have been proposed for fault tolerant shared memory multiprocessors. The performance study has been conducted by simulation using address traces collected from real parallel applications.
引用
收藏
页码:1101 / 1115
页数:15
相关论文
共 50 条