Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

被引:3
|
作者
Niederbrucker, Gerhard [1 ]
Strakova, Hana [1 ]
Gansterer, Wilfried N. [1 ]
机构
[1] Univ Vienna, Res Grp Theory & Applicat Algorithms, A-1010 Vienna, Austria
基金
奥地利科学基金会;
关键词
PERFORMANCE;
D O I
10.1109/SC.Companion.2012.89
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Most existing algorithms for parallel or distributed reduction operations are not able to handle temporary or permanent link and node failures. Only recently, methods were proposed which are in principal capable of tolerating link and node failures as well as soft errors like bit flips or message loss. A particularly interesting example is the push-flow algorithm. However, on closer inspection, it turns out that in this method the failure recovery often implies severe performance drawbacks. Existing mechanisms for failure handling may basically lead to a fall-back to an early stage of the computation and consequently slow down convergence or even prevent convergence if failures occur too frequently. Moreover, state-of-the-art fault tolerant distributed reduction algorithms may experience accuracy problems even in failure free systems. We present the push-cancel-flow (PCF) algorithm, a novel algorithmic enhancement of the push-flow algorithm. We show that the new push-cancel-flow algorithm exhibits superior accuracy, performance and fault tolerance over all other existing distributed reduction methods. Moreover, we employ the novel PCF algorithm in the context of a fully distributed QR factorization process and illustrate that the improvements achieved at the reduction level directly translate to higher level matrix operations, such as the considered QR factorization.
引用
收藏
页码:643 / 651
页数:9
相关论文
共 50 条
  • [41] Fault-tolerance in distributed query processing
    Smith, J
    Watson, P
    9TH INTERNATIONAL DATABASE ENGINEERING & APPLICATION SYMPOSIUM, PROCEEDINGS, 2005, : 329 - 338
  • [42] Distributed Control and Communication Fault Tolerance for the CKBot
    Park, Michael
    Yim, Mark
    RECONFIGURABLE MECHANISMS AND ROBOTS, 2009, : 673 - 679
  • [43] Communication fault tolerance in distributed robotic systems
    Molnár, P
    Starke, J
    DISTRIBUTED AUTONOMOUS ROBOTIC SYSTEMS, 2000, : 99 - 108
  • [44] Dynamic fault tolerance in distributed simulation system
    Ma, Min
    Jin, Shiyao
    Ye, Chaoqun
    Liu, Xiaojian
    COMPUTATIONAL SCIENCE - ICCS 2006, PT 1, PROCEEDINGS, 2006, 3991 : 769 - 776
  • [45] Fault Tolerance Communication in Mobile Distributed Networks
    Suganth, D. Bhuvana
    Manjunath, R.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT 2016, VOL 1, 2017, 468 : 77 - 87
  • [46] Fault tolerance for distributed process control system
    Takizawa, H
    SICE 2002: PROCEEDINGS OF THE 41ST SICE ANNUAL CONFERENCE, VOLS 1-5, 2002, : 3259 - 3263
  • [47] Flexible fault tolerance in distributed enterprise communities
    Ionescu, Mihail
    INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2012, 3 (04) : 224 - 232
  • [48] Optimizing fault tolerance in embedded distributed systems
    Draber, S
    IEEE MICRO, 2000, 20 (04) : 76 - 84
  • [49] Flexible Fault Tolerance in Distributed Enterprise Communities
    Ionescu, Mihail
    12TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC 2010), 2011, : 278 - 285
  • [50] Fault tolerance in a distributed CHORUS/MiX system
    Kittur, S
    Steel, D
    Armand, F
    Lipkis, J
    PROCEEDINGS OF THE USENIX 1996 ANNUAL TECHNICAL CONFERENCE, 1996, : 219 - 228