A distributed fault-tolerant asynchronous algorithm for performing N tasks

被引:0
|
作者
Weerasinghe, GM [1 ]
Lipsky, L [1 ]
机构
[1] Univ Connecticut, Dept Comp Sci & Engn, Storrs, CT 06269 USA
关键词
Networks of Workstations; message passing; performance evaluation; fault-tolerance; asynchronous; communication; dynamic load balancing;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper is a performance study of a fault-tolerant asynchronous algorithm for performing N independent and idempotent tasks on P processes. It is designed for the programming model Single Program Multiple Data (SPMD) and the failure model Fail-Stop failures without restarts. Our algorithm tolerates up to P - 1 process failures. That is, at least one process must survive for the lifetime of the application. The algorithm is structured in terms of a Symmetric Task Model in which each process is responsible for scheduling tasks dynamically, and distributing progress information. A parameter called Periodicity controls how often progress information is distributed to the rest of the processes. A process can fail while distributing its progress information, causing inconsistencies between task partitions of different processes. Therefore, the major design goals are: to optimize the scheduling phase such that in the presence of failures and communication time-outs, the number of tasks redone is minimized; to minimize the allocation of resources. In our study we avoid the use of checkpointing. Lost tasks are simply redone. Processes communicate only through asynchronous message passing. We present preliminary results of performance tests of this algorithm that we have implemented.
引用
收藏
页码:69 / 73
页数:5
相关论文
共 50 条