Iaso: an autonomous fault-tolerant management system for supercomputers

被引:6
|
作者
Lu, Kai [1 ,2 ]
Wang, Xiaoping [1 ,2 ]
Li, Gen [2 ]
Wang, Ruibo [2 ]
Chi, Wanqing [2 ]
Liu, Yongpeng [2 ]
Tang, Hongwei [2 ]
Feng, Hua [2 ]
Gao, Yinghui [3 ]
机构
[1] Natl Univ Def Technol, Sci & Technol Parallel & Distributed Proc Lab, Changsha 410073, Hunan, Peoples R China
[2] Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China
[3] Natl Univ Def Technol, ATR Lab, Changsha 410073, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
supercomputer; autonomous management; fault tolerant; fault management; MilkyWay-2; system;
D O I
10.1007/s11704-014-3503-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay-2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.
引用
收藏
页码:378 / 390
页数:13
相关论文
共 50 条
  • [21] An autonomous reconfigurable cell array for fault-tolerant LSIs
    Shibayama, A
    Igura, H
    Mizuno, M
    Yamashina, M
    1997 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE - DIGEST OF TECHNICAL PAPERS, 1997, 40 : 230 - 231
  • [22] Adaptive fault-tolerant control for an autonomous underwater vehicle
    Tabatabaee-Nasab, Fahimeh S.
    Moosavian, S. Ali A.
    Khalaji, Ali Keymasi
    ROBOTICA, 2022, 40 (11) : 4076 - 4089
  • [23] Fault-tolerant control of nonlinear system
    Yingwei Zhang
    Shuying Wu
    Yuan Wei
    International Journal of Control, Automation and Systems, 2011, 9 : 1116 - 1123
  • [24] THE DUALITY OF FAULT-TOLERANT SYSTEM STRUCTURES
    SHRIVASTAVA, SK
    MANCINI, LV
    RANDELL, B
    SOFTWARE-PRACTICE & EXPERIENCE, 1993, 23 (07): : 773 - 798
  • [25] THE FAULT-TOLERANT ARCHITECTURE OF THE SAFE SYSTEM
    MADEIRA, H
    FERNANDES, B
    RELA, M
    SILVA, JG
    MICROPROCESSING AND MICROPROGRAMMING, 1989, 27 (1-5): : 705 - 712
  • [26] Fault-tolerant control of nonlinear system
    Zhang Yingwei
    Wu Shuying
    2011 CHINESE CONTROL AND DECISION CONFERENCE, VOLS 1-6, 2011, : 1325 - 1329
  • [27] Immune system and fault-tolerant computing
    Xanthakis, S
    Karapoulios, S
    Pajot, R
    Rozz, A
    ARTIFICIAL EVOLUTION, 1996, 1063 : 181 - 197
  • [28] Active Fault-Tolerant Control System
    Nabil, Essam
    Sobaih, Abdel-Azem
    Abou-Zalam, Belal
    ICCES'2010: THE 2010 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING & SYSTEMS, 2010, : 274 - 279
  • [29] Reconfigurable Fault-Tolerant System Sychronization
    Balach, Jan
    Novak, Ondrej
    13TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN: ARCHITECTURES, METHODS AND TOOLS, 2010, : 817 - 820
  • [30] Fault-Tolerant AC Multidrive System
    Pulvirenti, Mario
    Scarcella, Giuseppe
    Scelba, Giacomo
    Cacciato, Mario
    Testa, Antonio
    IEEE JOURNAL OF EMERGING AND SELECTED TOPICS IN POWER ELECTRONICS, 2014, 2 (02) : 224 - 235