Proactive fault tolerance in MPI applications via task migration

被引:0
|
作者
Chakravorty, Sayantan [1 ]
Mendes, Celso L. [1 ]
Kale, Laxmikant V. [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, 1304 W Springfield Ave, Urbana, IL 61801 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.
引用
收藏
页码:485 / +
页数:3
相关论文
共 50 条
  • [1] Proactive Fault Tolerance Using Preemptive Migration
    Engelmann, C.
    Vallee, G. R.
    Naughton, T.
    Scott, S. L.
    [J]. PROCEEDINGS OF THE PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, 2009, : 252 - 257
  • [2] OCFTL: An MPI Implementation-Independent Fault Tolerance Library for Task-Based Applications
    Di Francia Rosso, Pedro Henrique
    Francesquini, Emilio
    [J]. HIGH PERFORMANCE COMPUTING, CARLA 2021, 2022, 1540 : 131 - 147
  • [3] Replication-Based Fault Tolerance for MPI Applications
    Walters, John Paul
    Chaudhary, Vipin
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (07) : 997 - 1010
  • [4] A FAULT TOLERANCE SOLUTION FOR SEQUENTIAL AND MPI APPLICATIONS ON THE GRID
    Rodriguez, Gabriel
    Pardo, Xoan C.
    Martin, Maria J.
    Gonzalez, Patricia
    Diaz, Daniel
    [J]. SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2008, 9 (02): : 101 - 109
  • [5] A fault tolerance solution for sequential and MPI applications on the grid
    Computer Architecture Group, University of A Coruña, Spain
    [J]. Scalable Comput. Pract. Exp., 2008, 2 (101-109): : 101 - 109
  • [6] A Channel Memory based fault tolerance for MPI applications
    Selikhov, A
    Germain, C
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2005, 21 (05): : 709 - 715
  • [7] Providing Proactive Fault Tolerance as a Service for Cloud Applications
    Liu, Jing
    Zhao, Junfeng
    [J]. Proceedings 2016 IEEE World Congress on Services - SERVICES 2016, 2016, : 126 - 127
  • [8] Fault tolerance of MPI applications in exascale systems: The ULFM solution
    Losada, Nuria
    Gonzalez, Patricia
    Martin, Maria J.
    Bosilca, George
    Bouteiller, Aurelien
    Teranishi, Keita
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 106 (106): : 467 - 481
  • [9] Fault tolerance for cluster-oriented MPI parallel applications
    Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
    [J]. Qinghua Daxue Xuebao, 2006, 1 (67-69+110):
  • [10] Deploying fault-tolerance and task migration with NetSolve
    Plank, JS
    Casanova, H
    Beck, M
    Dongarra, J
    [J]. APPLIED PARALLEL COMPUTING: LARGE SCALE SCIENTIFIC AND INDUSTRIAL PROBLEMS, 1998, 1541 : 418 - 432