Fault-Tolerant Global Load Balancing in X10

被引:1
|
作者
Bungart, Marco [1 ]
Fohry, Claudia [1 ]
Posner, Jonas [1 ]
机构
[1] Univ Kassel, Res Grp Programming Languages Methodol, Kassel, Germany
关键词
Resilient X10; task pool; GLB; algorithmic resilience;
D O I
10.1109/SYNASC.2014.69
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Betweenness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.
引用
收藏
页码:471 / 478
页数:8
相关论文
共 50 条
  • [41] Fault-tolerant networked control systems under varying load
    Daoud, RM
    Amer, HH
    Elsayed, HM
    SMCIA/05: PROCEEDINGS OF THE 2005 IEEE MID-SUMMER WORKSHOP ON SOFT COMPUTING IN INDUSTRIAL APPLICATIONS, 2005, : 218 - 221
  • [42] Robust fault-tolerant controller design for aerodynamic load simulator
    Shamisa, Abdolah
    Kiani, Zahra
    AEROSPACE SCIENCE AND TECHNOLOGY, 2018, 78 : 332 - 341
  • [43] Load sharing in fault-tolerant real-time systems
    Rooholamini, M
    Hosseini, SH
    10TH INTERNATIONAL CONFERENCE ON COMPUTER APPLICATIONS IN INDUSTRY AND ENGINEERING, 1997, : 175 - 178
  • [44] Replicated process allocation for load distribution in fault-tolerant multicomputers
    Kim, J
    Lee, H
    Lee, S
    IEEE TRANSACTIONS ON COMPUTERS, 1997, 46 (04) : 499 - 505
  • [45] FAIL-MPI: How fault-tolerant is fault-tolerant MPI?
    Hoarau, William
    Lemarinier, Pierre
    Herault, Thomas
    Rodriguez, Eric
    Tixeuil, Sebastien
    Cappello, Franck
    2006 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, VOLS 1 AND 2, 2006, : 133 - +
  • [46] Fault-tolerant converter and fault-tolerant methods for switched reluctance generators
    Guoqiang Han
    Wanli Liu
    Zhe Lu
    Menglin Wu
    Hang Lin
    Journal of Power Electronics, 2022, 22 : 1723 - 1734
  • [47] Fault-tolerant converter and fault-tolerant methods for switched reluctance generators
    Han, Guoqiang
    Liu, Wanli
    Lu, Zhe
    Wu, Menglin
    Lin, Hang
    JOURNAL OF POWER ELECTRONICS, 2022, 22 (10) : 1723 - 1734
  • [48] Auto-scalable and fault-tolerant load balancing mechanism for cloud computing based on the proof-of-work election
    Feng, Xiaoqin
    Ma, Jianfeng
    Liu, Shaobin
    Miao, Yinbin
    Liu, Ximeng
    SCIENCE CHINA-INFORMATION SCIENCES, 2022, 65 (01)
  • [49] Auto-scalable and fault-tolerant load balancing mechanism for cloud computing based on the proof-of-work election
    Xiaoqin Feng
    Jianfeng Ma
    Shaobin Liu
    Yinbin Miao
    Ximeng Liu
    Science China Information Sciences, 2022, 65
  • [50] Auto-scalable and fault-tolerant load balancing mechanism for cloud computing based on the proof-of-work election
    Xiaoqin FENG
    Jianfeng MA
    Shaobin LIU
    Yinbin MIAO
    Ximeng LIU
    ScienceChina(InformationSciences), 2022, 65 (01) : 131 - 146