An architecture for rapid distributed fault tolerance

被引:0
|
作者
Russ, SH [1 ]
机构
[1] Mississippi State Univ, NSF, Engn Res Ctr Computat Field Simulat, Mississippi State, MS 39762 USA
来源
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Embedded high performance computing is being called upon to provide critical computing resources with increasing frequency. The ability to tolerate faults during operation, both maintaining operational capability and ensuring that correct results continue to be produced, is an important ingredient in mission-critical systems. An architecture for such a system is proposed, providing the ability to withstand faults with graceful degradation in performance and complete transparency to the applications programmer. The final system will be able to offer fault-tolerant computing transparently to MPI applications and draws heavily on existing, demonstrated successes.
引用
收藏
页码:925 / 930
页数:6
相关论文
共 50 条
  • [1] THE MAFT ARCHITECTURE FOR DISTRIBUTED FAULT TOLERANCE
    KIECKHAFER, RM
    WALTER, CJ
    FINN, AM
    THAMBIDURAI, PM
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 1988, 37 (04) : 398 - 405
  • [2] EVALUATION AND DESIGN OF AN ULTRA-RELIABLE DISTRIBUTED ARCHITECTURE FOR FAULT TOLERANCE
    WALTER, CJ
    [J]. IEEE TRANSACTIONS ON RELIABILITY, 1990, 39 (04) : 492 - 499
  • [3] A Fault-Tolerance Architecture for Kepler-Based Distributed Scientific Workflows
    Mouallem, Pierre
    Crawl, Daniel
    Altintas, Ilkay
    Vouk, Mladen
    Yildiz, Ustun
    [J]. SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2010, 6187 : 452 - +
  • [4] FAULT TOLERANCE IN DISTRIBUTED SYSTEMS
    SCHMITTER, E
    [J]. SIEMENS FORSCHUNGS-UND ENTWICKLUNGSBERICHTE-SIEMENS RESEARCH AND DEVELOPMENT REPORTS, 1983, 12 (01): : 34 - 37
  • [5] Fault Tolerance in Distributed Paradigms
    Haider, Sajjad
    Ansari, Naveed Riaz
    Akbar, Muhammad
    Perwez, Mohammad Raza
    Ghori, Khawaja MoyeezUllah
    [J]. COMPUTER COMMUNICATION AND MANAGEMENT, 2011, 5 : 587 - 592
  • [6] FAULT TOLERANCE IN DISTRIBUTED UNIX
    BORG, A
    BLAU, W
    OBERLE, W
    GRAETSCH, W
    [J]. LECTURE NOTES IN COMPUTER SCIENCE, 1990, 448 : 224 - 243
  • [7] The research and implementation of a CORBA-based architecture for adaptive fault tolerance in distributed systems
    Li, QL
    Chen, Y
    Zhou, MT
    Wu, Y
    [J]. FIFTH INTERNATIONAL CONFERENCE ON ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, PROCEEDINGS, 2002, : 408 - 411
  • [8] Architecture based approach to adaptable fault tolerance in distributed object-oriented computing
    Lanka, R
    Oda, K
    Yoshida, T
    [J]. EMBEDDED AND UBIQUITOUS COMPUTING - EUC 2005 WORKSHOPS, PROCEEDINGS, 2005, 3823 : 413 - 422
  • [9] Increasing the Fault Tolerance in Microservice Architecture
    Hlybovets, A.
    Paprotskyi, I.
    [J]. CYBERNETICS AND SYSTEMS ANALYSIS, 2024, 60 (03) : 480 - 488
  • [10] Incorporating fault tolerance in distributed applications
    Ouyang, J
    Maheshwari, P
    [J]. PROCEEDINGS OF THE 21ST AUSTRALASIAN COMPUTER SCIENCE CONFERENCE, ACSC'98, 1998, 20 (01): : 121 - 132