Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

被引：3

作者：

Niederbrucker, Gerhard ^{[1
]}

Strakova, Hana ^{[1
]}

Gansterer, Wilfried N. ^{[1
]}

机构：

[1] Univ Vienna, Res Grp Theory & Applicat Algorithms, A-1010 Vienna, Austria

来源：

2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC) | 2012年

基金：

奥地利科学基金会;

关键词：

PERFORMANCE;

D O I：

10.1109/SC.Companion.2012.89

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Most existing algorithms for parallel or distributed reduction operations are not able to handle temporary or permanent link and node failures. Only recently, methods were proposed which are in principal capable of tolerating link and node failures as well as soft errors like bit flips or message loss. A particularly interesting example is the push-flow algorithm. However, on closer inspection, it turns out that in this method the failure recovery often implies severe performance drawbacks. Existing mechanisms for failure handling may basically lead to a fall-back to an early stage of the computation and consequently slow down convergence or even prevent convergence if failures occur too frequently. Moreover, state-of-the-art fault tolerant distributed reduction algorithms may experience accuracy problems even in failure free systems. We present the push-cancel-flow (PCF) algorithm, a novel algorithmic enhancement of the push-flow algorithm. We show that the new push-cancel-flow algorithm exhibits superior accuracy, performance and fault tolerance over all other existing distributed reduction methods. Moreover, we employ the novel PCF algorithm in the context of a fully distributed QR factorization process and illustrate that the improvements achieved at the reduction level directly translate to higher level matrix operations, such as the considered QR factorization.

引用

页码：643 / 651

页数：9

共 50 条

[31] Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy
Bouteiller, Aurelien
Herault, Thomas
Bosilca, George
Du, Peng
Dongarra, Jack
ACM Transactions on Parallel Computing, 2015, 1 (02)
[32] Fault Tolerance Management in Distributed Systems: A New Leader-Based Consensus Algorithm
Hanna, Fouad
Lapayre, Jean-Christophe
Droz-Bartholet, Lionel
2014 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2014, : 234 - 242
[33] A DISTRIBUTED ALGORITHM FOR EMBEDDING TREES IN HYPERCUBES WITH MODIFICATIONS FOR RUN-TIME FAULT TOLERANCE
PROVOST, FJ
MELHEM, R
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1992, 14 (01) : 85 - 89
[34] Improving the fault tolerance of GSM networks
Chang, MF
Lin, YB
Su, SC
IEEE NETWORK, 1998, 12 (01): : 58 - 63
[35] Algorithm for directing cooperative vehicles of a vehicle routing problem for improving fault-tolerance
Dulai, Tibor
Werner-Stark, Agnes
Hangos, Katalin Maria
OPTIMIZATION AND ENGINEERING, 2018, 19 (02) : 239 - 270
[36] Algorithm for directing cooperative vehicles of a vehicle routing problem for improving fault-tolerance
Tibor Dulai
Ágnes Werner-Stark
Katalin Mária Hangos
Optimization and Engineering, 2018, 19 : 239 - 270
[37] IMPROVING THE RELIABILITY OF BUS SYSTEMS - FAULT ISOLATION AND FAULT TOLERANCE
VOGT, R
MICROPROCESSING AND MICROPROGRAMMING, 1987, 21 (1-5): : 333 - 338
[38] Distributed fault tolerance in optimal interpolative nets
Simon, D
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2001, 12 (06): : 1348 - 1357
[39] Fault tolerance in distributed industrial control systems
Campelo, JC
Rubio, A
Rodríguez, F
Serrano, JJ
PROCEEDINGS OF THE COMMUNICATION NETWORKS AND DISTRIBUTED SYSTEMS MODELING AND SIMULATION (CNDS'98), 1998, : 87 - 92
[40] Fault Tolerance Model for Hadoop Distributed System
Ahmed, Soraya Setti
Slimani, Yahya
Frefita, Riadh
JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2025, 31 (01) : 72 - 92

← 1 2 3 4 5 →