FAULT TOLERANCE SCHEMES FOR GLOBAL LOAD BALANCING IN X10

被引:0
|
作者
Fohry, Claudia [1 ]
Bungart, Marco [1 ]
Posner, Jonas [1 ]
机构
[1] Univ Kassel, Res Grp Programming Languages Methodol, D-34125 Kassel, Germany
来源
关键词
Resilient X10; task pool; GLB; algorithmic resilience; lifeline scheme;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Scalability postulates fault tolerance to be efficient. One approach handles permanent node failures at user level. It is supported by Resilient X10, a Partitioned Global Address Space language that throws an exception when a place fails. We consider task pools, which are a widely used pattern for load balancing of irregular applications, and refer to the variant that is implemented in the Global Load Balancing framework GLB of X10. Here, each worker maintains a private pool and supports cooperative work stealing. Victim selection and termination detection follow the lifeline scheme. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. We consider a single worker per node, and assume that failures are rare and uncorrelated. The paper introduces two fault tolerance schemes. Both are based on regular backups of the local task pool contents, which are written to the main memory of another worker and updated in the event of stealing. The first scheme mainly relies on synchronous communication. The second scheme deploys asynchronous communication, and significantly improves on the first scheme's efficiency and robustness. Both schemes have been implemented by extending the GLB source code. Experiments were run with the Unbalanced Tree Search (UTS) and Betweenness Centrality benchmarks. For UTS on 128 nodes, for instance, we observed an overhead of about 81% with the synchronous scheme and about 7% with the asynchronous scheme. The protocol overhead for a place failure was negligible.
引用
收藏
页码:169 / 185
页数:17
相关论文
共 50 条
  • [1] Fault-Tolerant Global Load Balancing in X10
    Bungart, Marco
    Fohry, Claudia
    Posner, Jonas
    16TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC 2014), 2014, : 471 - 478
  • [2] Load Balancing GridSim Architecture with Fault Tolerance
    Nanthiya, D.
    Keerthika, P.
    2013 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2013, : 425 - 428
  • [3] Improving Fault Tolerance And Load Balancing In Wireless Networks
    Rathika, S. K. B.
    2013 INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN ENGINEERING AND TECHNOLOGY (ICCTET), 2013, : 296 - U426
  • [4] LBFT: Load Balancing and Fault Tolerance in distributed controllers
    Mahjoubi, Ayeh
    Zeynalpour, Omid
    Eslami, Benyamin
    Yazdani, Nasser
    2019 INTERNATIONAL SYMPOSIUM ON NETWORKS, COMPUTERS AND COMMUNICATIONS (ISNCC 2019), 2019,
  • [5] A Malleable and Fault-Tolerant Task Pool Framework for X10
    Bungart, Marco
    Fohry, Claudia
    2017 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2017, : 749 - 757
  • [6] Symmetric distributed computing with dynamic load balancing and fault tolerance
    Bubeck, T
    Kuchlin, W
    Rosenstiel, W
    LANGUAGES, COMPILERS AND RUN-TIME SYSTEMS FOR SCALABLE COMPUTERS, 1996, : 325 - 328
  • [7] Proactive load balancing fault tolerance algorithm in cloud computing
    Attallah, Salma M. A.
    Fayek, Magda B.
    Nassar, Salwa M.
    Hemayed, Elsayed E.
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (10):
  • [9] Fault tolerance based load balancing approach for web resources
    Shukla, Anju
    Kumar, Shishir
    Singh, Harikesh
    JOURNAL OF THE CHINESE INSTITUTE OF ENGINEERS, 2019, 42 (07) : 583 - 592
  • [10] Integrated Load Balancing Approach for Fault Tolerance in MPLS Networks
    Singh, Ravindra Kumar
    Chaudhari, Narendra S.
    2013 INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT 2013), 2013, : 295 - 298