A distributed fault-tolerant asynchronous algorithm for performing N tasks

被引：0

作者：

Weerasinghe, GM ^{[1
]}

Lipsky, L ^{[1
]}

机构：

[1] Univ Connecticut, Dept Comp Sci & Engn, Storrs, CT 06269 USA

来源：

COMPUTERS AND THEIR APPLICATIONS | 2001年

关键词：

Networks of Workstations; message passing; performance evaluation; fault-tolerance; asynchronous; communication; dynamic load balancing;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper is a performance study of a fault-tolerant asynchronous algorithm for performing N independent and idempotent tasks on P processes. It is designed for the programming model Single Program Multiple Data (SPMD) and the failure model Fail-Stop failures without restarts. Our algorithm tolerates up to P - 1 process failures. That is, at least one process must survive for the lifetime of the application. The algorithm is structured in terms of a Symmetric Task Model in which each process is responsible for scheduling tasks dynamically, and distributing progress information. A parameter called Periodicity controls how often progress information is distributed to the rest of the processes. A process can fail while distributing its progress information, causing inconsistencies between task partitions of different processes. Therefore, the major design goals are: to optimize the scheduling phase such that in the presence of failures and communication time-outs, the number of tasks redone is minimized; to minimize the allocation of resources. In our study we avoid the use of checkpointing. Lost tasks are simply redone. Processes communicate only through asynchronous message passing. We present preliminary results of performance tests of this algorithm that we have implemented.

引用

页码：69 / 73

页数：5

共 50 条

[41] An algorithm for online distributed fault-tolerant job scheduling in grid computing
Zeng, Jun
INTERNATIONAL JOURNAL OF WEB AND GRID SERVICES, 2021, 17 (04) : 389 - 407
[42] Construction and Formal Verification of a Fault-Tolerant Distributed Mutual Exclusion Algorithm
Shishkin, Evgeniy
PROCEEDINGS OF THE 16TH ACM SIGPLAN INTERNATIONAL WORKSHOP ON ERLANG (ERLANG '17), 2017, : 1 - 12
[43] A TOKEN-BASED FAULT-TOLERANT DISTRIBUTED MUTUAL EXCLUSION ALGORITHM
AGRAWAL, D
ELABBADI, A
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1995, 24 (02) : 164 - 176
[44] Fault-Tolerant Scheduling Algorithm for Periodic Real-Time Tasks in Clouds
Guo, Pengze
Liu, Ming
Xue, Zhi
PROCEEDINGS OF 2018 IEEE 4TH INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC 2018), 2018, : 467 - 470
[45] Developing fault-tolerant distributed loops
Farrag, A. A.
INFORMATION PROCESSING LETTERS, 2010, 111 (02) : 97 - 101
[46] Recovery in fault-tolerant distributed microcontrollers
Rennels, DA
Hwang, R
INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2001, : 475 - 480
[47] WORKSHOP ON DISTRIBUTED FAULT-TOLERANT COMPUTERS
GOLDBERG, J
COMPUTER, 1977, 10 (03) : 51 - 52
[48] Fault-tolerant Distributed Systems in Hardware
Schmid, Stefan
BULLETIN OF THE EUROPEAN ASSOCIATION FOR THEORETICAL COMPUTER SCIENCE, 2015, 2015 (116): : 111 - 153
[49] Fault-Tolerant Distributed Transactions on Blockchain
Jagadish, H.V.
Tamer Özsu, M.
1600, Morgan and Claypool Publishers (16): : 1 - 268
[50] Fault-tolerant scheduling algorithm for real-time tasks in virtualized cloud
Wang, Ji, 1600, Editorial Board of Journal on Communications (35):

← 1 2 3 4 5 →