A distributed fault-tolerant asynchronous algorithm for performing N tasks

被引:0
|
作者
Weerasinghe, GM [1 ]
Lipsky, L [1 ]
机构
[1] Univ Connecticut, Dept Comp Sci & Engn, Storrs, CT 06269 USA
关键词
Networks of Workstations; message passing; performance evaluation; fault-tolerance; asynchronous; communication; dynamic load balancing;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper is a performance study of a fault-tolerant asynchronous algorithm for performing N independent and idempotent tasks on P processes. It is designed for the programming model Single Program Multiple Data (SPMD) and the failure model Fail-Stop failures without restarts. Our algorithm tolerates up to P - 1 process failures. That is, at least one process must survive for the lifetime of the application. The algorithm is structured in terms of a Symmetric Task Model in which each process is responsible for scheduling tasks dynamically, and distributing progress information. A parameter called Periodicity controls how often progress information is distributed to the rest of the processes. A process can fail while distributing its progress information, causing inconsistencies between task partitions of different processes. Therefore, the major design goals are: to optimize the scheduling phase such that in the presence of failures and communication time-outs, the number of tasks redone is minimized; to minimize the allocation of resources. In our study we avoid the use of checkpointing. Lost tasks are simply redone. Processes communicate only through asynchronous message passing. We present preliminary results of performance tests of this algorithm that we have implemented.
引用
收藏
页码:69 / 73
页数:5
相关论文
共 50 条
  • [21] Fault-tolerant distributed simulation
    Damani, OP
    Garg, VK
    TWELFTH WORKSHOP ON PARALLEL AND DISTRIBUTED SIMULATION - PADS'98, PROCEEDINGS, 1998, : 38 - 45
  • [22] VLSI Implementation of a Distributed Algorithm for Fault-Tolerant Clock Generation
    Fuchs, Gottfried
    Steininger, Andreas
    JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING, 2011, 2011
  • [23] A DISTRIBUTED APPROXIMATION ALGORITHM FOR FAULT-TOLERANT METRIC FACILITY LOCATION
    Xu, Shihong
    Shen, Hong
    INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2011, 22 (05) : 1019 - 1034
  • [24] An algorithm for automatically obtaining distributed and fault-tolerant static schedules
    Girault, A
    Kalla, H
    Sighireanu, M
    Sorel, Y
    2003 INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2003, : 159 - 168
  • [25] A Distributed Fault-tolerant Clustering Algorithm for Wireless Sensor Networks
    Azharuddin, Md
    Kuila, Pratyay
    Jana, Prasanta K.
    2013 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2013, : 997 - 1002
  • [26] A Scalable and Reconfigurable Fault-Tolerant Distributed Routing Algorithm for NoCs
    Shi, Zewen
    Zeng, Xiaoyang
    Yu, Zhiyi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (07): : 1386 - 1397
  • [27] An adaptive distributed fault-tolerant routing algorithm for the star graph
    Bai, LQ
    Ebara, H
    Nakano, H
    Maeda, H
    ALGORITHMS AND COMPUTATION, PROCEEDINGS, 1997, 1350 : 62 - 71
  • [28] Convergence Rate Analysis of a Fault-Tolerant Distributed Consensus Algorithm
    Haseltalab, Ali
    Akar, Mehmet
    2015 54TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2015, : 5111 - 5116
  • [29] Distributed recovery block approach to fault-tolerant execution of application tasks in hypercubes
    Kim, K.H.
    Kavianpour, A.
    IEEE Transactions on Parallel and Distributed Systems, 1993, 4 (01) : 106 - 111
  • [30] A DISTRIBUTED RECOVERY BLOCK APPROACH TO FAULT-TOLERANT EXECUTION OF APPLICATION TASKS IN HYPERCUBES
    KIM, KH
    KAVIANPOUR, A
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1993, 4 (01) : 104 - 111