Supporting cost-effective fault tolerance in distributed message-passing applications with file operations

被引:0
|
作者
Performance Technology Center, Hewlett-Packard Company, Roseville, CA 95747, United States [1 ]
不详 [2 ]
机构
来源
J Supercomput | / 3卷 / 207-232期
关键词
Algorithms - Computer system recovery - Computer systems programming - Data communication systems - Fault tolerant computer systems - File organization - Response time (computer systems) - Software engineering - Subroutines;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.
引用
收藏
相关论文
共 44 条
  • [11] A cost-effective distributed file service with QoS guarantees
    Le, Kien
    Bianchini, Ricardo
    Nguyen, Thu D.
    MIDDLEWARE 2007, PROCEEDINGS, 2007, 4834 : 223 - 243
  • [12] A NEW APPROACH TO SYSTEM-LEVEL FAULT-TOLERANCE IN MESSAGE-PASSING MULTICOMPUTERS
    ZIMMERMAN, GW
    ESFAHANIAN, AH
    LECTURE NOTES IN COMPUTER SCIENCE, 1991, 507 : 357 - 363
  • [13] A tool for the development of meta-applications supporting several message-passing programming environments
    Baraglia, R
    Ferrini, R
    Laforenza, D
    Sgherri, R
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 1998, 1497 : 249 - 256
  • [14] A Message-Passing Approach to Min-Cost Distributed Clustering in Wireless Sensor Networks
    Ngo, Hung Q.
    Tam, Tran Minh
    Lee, Young-Koo
    Lee, Sungyoung
    2008 INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES FOR COMMUNICATIONS, PROCEEDINGS, 2008, : 174 - 177
  • [15] Fault-tolerant protocol for hybrid task-parallel message-passing applications
    Martsinkevich, Tatiana
    Subasi, Omer
    Unsal, Osman
    Labarta, Jesus
    Cappello, Franck
    2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, : 563 - 570
  • [16] Supporting fault-tolerance in heterogeneous distributed applications
    Maheshwari, P
    Ouyang, J
    SIXTH HETEROGENEOUS COMPUTING WORKSHOP (HCW '97), PROCEEDINGS, 1997, : 195 - 207
  • [17] Design, programming environment and applications of a simple low-cost message-passing multicomputer
    Indian Inst of Science, Bangalore, India
    J Indian Inst Sci, 3 (337-361):
  • [18] Multiplexing schemes for cost-effective fault-tolerance
    Roy, S
    Beiu, V
    2004 4TH IEEE CONFERENCE ON NANOTECHNOLOGY, 2004, : 589 - 592
  • [19] DiCER: Distributed and cost-effective redundancy for variation tolerance
    Wu, D
    Venkataraman, G
    Hu, J
    Li, QY
    Mahapatra, R
    ICCAD-2005: INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN, DIGEST OF TECHNICAL PAPERS, 2005, : 393 - 397
  • [20] COST-EFFECTIVE AND FLEXIBLE SCHEME FOR SOFTWARE FAULT-TOLERANCE
    BONDAVALLI, A
    DIGIANDOMENICO, F
    XU, J
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 1993, 8 (04): : 234 - 244