Iaso: an autonomous fault-tolerant management system for supercomputers

被引:6
|
作者
Lu, Kai [1 ,2 ]
Wang, Xiaoping [1 ,2 ]
Li, Gen [2 ]
Wang, Ruibo [2 ]
Chi, Wanqing [2 ]
Liu, Yongpeng [2 ]
Tang, Hongwei [2 ]
Feng, Hua [2 ]
Gao, Yinghui [3 ]
机构
[1] Natl Univ Def Technol, Sci & Technol Parallel & Distributed Proc Lab, Changsha 410073, Hunan, Peoples R China
[2] Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China
[3] Natl Univ Def Technol, ATR Lab, Changsha 410073, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
supercomputer; autonomous management; fault tolerant; fault management; MilkyWay-2; system;
D O I
10.1007/s11704-014-3503-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay-2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.
引用
收藏
页码:378 / 390
页数:13
相关论文
共 50 条
  • [31] A fault-tolerant legion authentication system
    Aqeel, Muhammad
    Ansari, M. A.
    Second International Conference on Emerging Technologies 2006, Proceedings, 2006, : 432 - 437
  • [32] FAULT-TOLERANT DATAFLOW SYSTEM.
    Srini, Vason P.
    1600, (18):
  • [33] A fault-tolerant cooperative distributed system
    Deen, SM
    NINTH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 1998, : 508 - 513
  • [34] Fault-tolerant grid monitoring system
    Li, Yiqi
    Dong, Shouling
    Zhang, Ling
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2006, 34 (SUPPL.): : 164 - 166
  • [35] FAULT-TOLERANT DIGITAL CLOCKING SYSTEM
    MOREIRADESOUZA, J
    PEIXOTOPAZ, E
    ELECTRONICS LETTERS, 1975, 11 (18) : 433 - 434
  • [36] A fault-tolerant architecture for Grid system
    Liu, LX
    Wu, QY
    Zhou, B
    GRID AND COOPERATIVE COMPUTING GCC 2004, PROCEEDINGS, 2004, 3251 : 58 - 64
  • [37] Fault-tolerant servers for the RHODOS system
    Zhou, WL
    Goscinski, A
    JOURNAL OF SYSTEMS AND SOFTWARE, 1997, 37 (03) : 201 - 214
  • [38] A MapReduce system with fault-tolerant mechanism
    Shi, Yi
    Geng, Chen
    Qi, Yong
    Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University, 2014, 48 (02): : 1 - 7
  • [39] Fault-Tolerant Control of Nonlinear System
    Zhang, Yingwei
    Wu, Shuying
    Wei, Yuan
    INTERNATIONAL JOURNAL OF CONTROL AUTOMATION AND SYSTEMS, 2011, 9 (06) : 1116 - 1123
  • [40] Fault-tolerant energy scheduling system
    Mahendra, Lagineni
    Mohan, Katta Jagan
    Kumar, R. K. Senthil
    Prasad, G. L. Ganga
    2016 IEEE 6TH INTERNATIONAL CONFERENCE ON POWER SYSTEMS (ICPS), 2016,