Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

被引:3
|
作者
Niederbrucker, Gerhard [1 ]
Strakova, Hana [1 ]
Gansterer, Wilfried N. [1 ]
机构
[1] Univ Vienna, Res Grp Theory & Applicat Algorithms, A-1010 Vienna, Austria
基金
奥地利科学基金会;
关键词
PERFORMANCE;
D O I
10.1109/SC.Companion.2012.89
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Most existing algorithms for parallel or distributed reduction operations are not able to handle temporary or permanent link and node failures. Only recently, methods were proposed which are in principal capable of tolerating link and node failures as well as soft errors like bit flips or message loss. A particularly interesting example is the push-flow algorithm. However, on closer inspection, it turns out that in this method the failure recovery often implies severe performance drawbacks. Existing mechanisms for failure handling may basically lead to a fall-back to an early stage of the computation and consequently slow down convergence or even prevent convergence if failures occur too frequently. Moreover, state-of-the-art fault tolerant distributed reduction algorithms may experience accuracy problems even in failure free systems. We present the push-cancel-flow (PCF) algorithm, a novel algorithmic enhancement of the push-flow algorithm. We show that the new push-cancel-flow algorithm exhibits superior accuracy, performance and fault tolerance over all other existing distributed reduction methods. Moreover, we employ the novel PCF algorithm in the context of a fully distributed QR factorization process and illustrate that the improvements achieved at the reduction level directly translate to higher level matrix operations, such as the considered QR factorization.
引用
收藏
页码:643 / 651
页数:9
相关论文
共 50 条
  • [31] Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy
    Bouteiller, Aurelien
    Herault, Thomas
    Bosilca, George
    Du, Peng
    Dongarra, Jack
    ACM Transactions on Parallel Computing, 2015, 1 (02)
  • [32] Fault Tolerance Management in Distributed Systems: A New Leader-Based Consensus Algorithm
    Hanna, Fouad
    Lapayre, Jean-Christophe
    Droz-Bartholet, Lionel
    2014 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2014, : 234 - 242
  • [33] A DISTRIBUTED ALGORITHM FOR EMBEDDING TREES IN HYPERCUBES WITH MODIFICATIONS FOR RUN-TIME FAULT TOLERANCE
    PROVOST, FJ
    MELHEM, R
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1992, 14 (01) : 85 - 89
  • [34] Improving the fault tolerance of GSM networks
    Chang, MF
    Lin, YB
    Su, SC
    IEEE NETWORK, 1998, 12 (01): : 58 - 63
  • [35] Algorithm for directing cooperative vehicles of a vehicle routing problem for improving fault-tolerance
    Dulai, Tibor
    Werner-Stark, Agnes
    Hangos, Katalin Maria
    OPTIMIZATION AND ENGINEERING, 2018, 19 (02) : 239 - 270
  • [36] Algorithm for directing cooperative vehicles of a vehicle routing problem for improving fault-tolerance
    Tibor Dulai
    Ágnes Werner-Stark
    Katalin Mária Hangos
    Optimization and Engineering, 2018, 19 : 239 - 270
  • [37] IMPROVING THE RELIABILITY OF BUS SYSTEMS - FAULT ISOLATION AND FAULT TOLERANCE
    VOGT, R
    MICROPROCESSING AND MICROPROGRAMMING, 1987, 21 (1-5): : 333 - 338
  • [38] Distributed fault tolerance in optimal interpolative nets
    Simon, D
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 2001, 12 (06): : 1348 - 1357
  • [39] Fault tolerance in distributed industrial control systems
    Campelo, JC
    Rubio, A
    Rodríguez, F
    Serrano, JJ
    PROCEEDINGS OF THE COMMUNICATION NETWORKS AND DISTRIBUTED SYSTEMS MODELING AND SIMULATION (CNDS'98), 1998, : 87 - 92
  • [40] Fault Tolerance Model for Hadoop Distributed System
    Ahmed, Soraya Setti
    Slimani, Yahya
    Frefita, Riadh
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2025, 31 (01) : 72 - 92