Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

被引:3
|
作者
Niederbrucker, Gerhard [1 ]
Strakova, Hana [1 ]
Gansterer, Wilfried N. [1 ]
机构
[1] Univ Vienna, Res Grp Theory & Applicat Algorithms, A-1010 Vienna, Austria
基金
奥地利科学基金会;
关键词
PERFORMANCE;
D O I
10.1109/SC.Companion.2012.89
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Most existing algorithms for parallel or distributed reduction operations are not able to handle temporary or permanent link and node failures. Only recently, methods were proposed which are in principal capable of tolerating link and node failures as well as soft errors like bit flips or message loss. A particularly interesting example is the push-flow algorithm. However, on closer inspection, it turns out that in this method the failure recovery often implies severe performance drawbacks. Existing mechanisms for failure handling may basically lead to a fall-back to an early stage of the computation and consequently slow down convergence or even prevent convergence if failures occur too frequently. Moreover, state-of-the-art fault tolerant distributed reduction algorithms may experience accuracy problems even in failure free systems. We present the push-cancel-flow (PCF) algorithm, a novel algorithmic enhancement of the push-flow algorithm. We show that the new push-cancel-flow algorithm exhibits superior accuracy, performance and fault tolerance over all other existing distributed reduction methods. Moreover, we employ the novel PCF algorithm in the context of a fully distributed QR factorization process and illustrate that the improvements achieved at the reduction level directly translate to higher level matrix operations, such as the considered QR factorization.
引用
收藏
页码:643 / 651
页数:9
相关论文
共 50 条
  • [1] IMPROVED DISTRIBUTED FAULT TOLERANT CLUSTERING ALGORITHM FOR FAULT TOLERANCE IN WSN
    Kaur, Mandeep
    Garg, Parul
    2016 INTERNATIONAL CONFERENCE ON MICRO-ELECTRONICS AND TELECOMMUNICATION ENGINEERING (ICMETE), 2016, : 197 - 201
  • [2] Distributed SignSGD With Improved Accuracy and Network-Fault Tolerance
    Le Trieu Phong
    Tran Thi Phuong
    IEEE ACCESS, 2020, 8 : 191839 - 191849
  • [3] A new algorithm for increasing fault-tolerance of distributed systems
    Dishabi, Mohammad Reza Ebrahimi
    Sharifi, Mohsen
    PROCEEDINGS OF THE SIXTH IASTED INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORKS, 2007, : 96 - +
  • [4] Enhancing fault-tolerance in a distributed mutual exclusion algorithm
    Reddy, P. Sukendar
    Sarma, Nityananda
    Das, Rajib Kumar
    ICIT 2006: 9TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY, PROCEEDINGS, 2006, : 56 - +
  • [5] A modified learning algorithm for improving the fault tolerance of BP networks
    Wei, NH
    Yang, SY
    Tong, SB
    ICNN - 1996 IEEE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS, VOLS. 1-4, 1996, : 247 - 252
  • [6] Fault Tolerance in Distributed Database Management Systems - Improving reliability with RAID
    Pareek, Sumit
    Sharma, Nishant
    Mary, Geetha A.
    2019 INNOVATIONS IN POWER AND ADVANCED COMPUTING TECHNOLOGIES (I-PACT), 2019,
  • [7] Distributed Node Fault Detection and Tolerance Algorithm for Controller Area Networks
    Nath, Nithish N.
    Pillay, V. Radhamani
    Saisuriyaa, G.
    INTELLIGENT SYSTEMS TECHNOLOGIES AND APPLICATIONS, VOL 2, 2016, 385 : 247 - 257
  • [8] An Adaptive Replicas Creation Algorithm with Fault Tolerance in the Distributed Storage Network
    Cao Huaihu
    Zhu Jianming
    2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL II, PROCEEDINGS, 2008, : 738 - 741
  • [9] A Stable Fault tolerance Algorithm for Leader Crash in Distributed Honeycomb Networks
    Al-Refai, Mohammed N.
    2021 22ND INTERNATIONAL ARAB CONFERENCE ON INFORMATION TECHNOLOGY (ACIT), 2021, : 671 - 680
  • [10] Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy
    Goncalves, Marcio M.
    Lamb, Ivan Peter
    Rech, Paolo
    Brum, Raphael M.
    Azambuja, Jose Rodrigo
    IEEE TRANSACTIONS ON NUCLEAR SCIENCE, 2020, 67 (07) : 1573 - 1580