Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

被引:3
|
作者
Niederbrucker, Gerhard [1 ]
Strakova, Hana [1 ]
Gansterer, Wilfried N. [1 ]
机构
[1] Univ Vienna, Res Grp Theory & Applicat Algorithms, A-1010 Vienna, Austria
基金
奥地利科学基金会;
关键词
PERFORMANCE;
D O I
10.1109/SC.Companion.2012.89
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Most existing algorithms for parallel or distributed reduction operations are not able to handle temporary or permanent link and node failures. Only recently, methods were proposed which are in principal capable of tolerating link and node failures as well as soft errors like bit flips or message loss. A particularly interesting example is the push-flow algorithm. However, on closer inspection, it turns out that in this method the failure recovery often implies severe performance drawbacks. Existing mechanisms for failure handling may basically lead to a fall-back to an early stage of the computation and consequently slow down convergence or even prevent convergence if failures occur too frequently. Moreover, state-of-the-art fault tolerant distributed reduction algorithms may experience accuracy problems even in failure free systems. We present the push-cancel-flow (PCF) algorithm, a novel algorithmic enhancement of the push-flow algorithm. We show that the new push-cancel-flow algorithm exhibits superior accuracy, performance and fault tolerance over all other existing distributed reduction methods. Moreover, we employ the novel PCF algorithm in the context of a fully distributed QR factorization process and illustrate that the improvements achieved at the reduction level directly translate to higher level matrix operations, such as the considered QR factorization.
引用
收藏
页码:643 / 651
页数:9
相关论文
共 50 条
  • [21] Incorporating fault tolerance in distributed applications
    Ouyang, J
    Maheshwari, P
    PROCEEDINGS OF THE 21ST AUSTRALASIAN COMPUTER SCIENCE CONFERENCE, ACSC'98, 1998, 20 (01): : 121 - 132
  • [22] THE MAFT ARCHITECTURE FOR DISTRIBUTED FAULT TOLERANCE
    KIECKHAFER, RM
    WALTER, CJ
    FINN, AM
    THAMBIDURAI, PM
    IEEE TRANSACTIONS ON COMPUTERS, 1988, 37 (04) : 398 - 405
  • [23] Fault Tolerance in Heterogeneous Distributed Systems
    Wang, Zhe
    Minsky, Naftaly H.
    2014 INTERNATIONAL CONFERENCE ON COLLABORATIVE COMPUTING: NETWORKING, APPLICATIONS AND WORKSHARING (COLLABORATECOM), 2014, : 539 - 545
  • [24] SYNCHRONIZATION AND FAULT TOLERANCE IN A DISTRIBUTED TRACKER
    LEIGHTON, DA
    HANSEN, BK
    SIGNAL AND DATA PROCESSING OF SMALL TARGETS 1989, 1989, 1096 : 224 - 230
  • [25] An architecture for rapid distributed fault tolerance
    Russ, SH
    PARALLEL AND DISTRIBUTED PROCESSING, 1998, 1388 : 925 - 930
  • [26] Fault Tolerance in Distributed Systems: A Survey
    Ledmi, Abdeldjalil
    Bendjenna, Hakim
    Hemam, Sofiane Mounine
    2018 3RD INTERNATIONAL CONFERENCE ON PATTERN ANALYSIS AND INTELLIGENT SYSTEMS (PAIS), 2018, : 235 - 239
  • [27] Fault Tolerance in Distributed Mechanism Design
    Gradwohl, Ronen
    INTERNET AND NETWORK ECONOMICS, PROCEEDINGS, 2008, 5385 : 539 - 547
  • [28] Distributed MapReduce Engine with Fault Tolerance
    Song, Lixing
    Wu, Shaoen
    Wang, Honggang
    Yang, Qing
    2014 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2014, : 3626 - 3630
  • [29] On verifying fault tolerance of distributed protocols
    Fisman, Dana
    Kupferman, Orna
    Lustig, Yoad
    TOOLS AND ALGORITHMS FOR THE CONSTRUCTION AND ANALYSIS OF SYSTEMS, 2008, 4963 : 315 - 331
  • [30] LAN DISTRIBUTED FAULT-TOLERANCE
    MIROJULIA, J
    DECENTRALIZED AND DISTRIBUTED SYSTEMS, 1993, 39 : 161 - 174