Iaso: an autonomous fault-tolerant management system for supercomputers

被引:6
|
作者
Lu, Kai [1 ,2 ]
Wang, Xiaoping [1 ,2 ]
Li, Gen [2 ]
Wang, Ruibo [2 ]
Chi, Wanqing [2 ]
Liu, Yongpeng [2 ]
Tang, Hongwei [2 ]
Feng, Hua [2 ]
Gao, Yinghui [3 ]
机构
[1] Natl Univ Def Technol, Sci & Technol Parallel & Distributed Proc Lab, Changsha 410073, Hunan, Peoples R China
[2] Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China
[3] Natl Univ Def Technol, ATR Lab, Changsha 410073, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
supercomputer; autonomous management; fault tolerant; fault management; MilkyWay-2; system;
D O I
10.1007/s11704-014-3503-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay-2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.
引用
收藏
页码:378 / 390
页数:13
相关论文
共 50 条
  • [1] Iaso: an autonomous fault-tolerant management system for supercomputers
    Kai Lu
    Xiaoping Wang
    Gen Li
    Ruibo Wang
    Wanqing Chi
    Yongpeng Liu
    Hongwei Tang
    Hua Feng
    Yinghui Gao
    Frontiers of Computer Science, 2014, 8 : 378 - 390
  • [2] FAULT-TOLERANT ROUTING IN MIN-BASED SUPERCOMPUTERS
    CHALASANI, S
    RAGHAVENDRA, CS
    VARMA, A
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1994, 22 (02) : 154 - 167
  • [3] Distributed Methods for Autonomous Robot Groups Fault-Tolerant Management
    Kalyaev, Igor
    Melnik, Eduard
    Klimenko, Anna
    INTERACTIVE COLLABORATIVE ROBOTICS, ICR 2020, 2020, 12336 : 135 - 147
  • [4] Fault-tolerant virtual private networks within an autonomous system
    Han, JH
    Malan, GR
    Jahanian, F
    21ST IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2002, : 41 - 50
  • [5] Autonomous agent based distributed fault-tolerant intrusion detection system
    Sen, J
    Sengupta, I
    DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY, PROCEEDINGS, 2005, 3816 : 125 - 131
  • [6] Fault detection and fault-tolerant control of dual-motor autonomous steering system
    Xu, Xing
    Wu, Zhongwei
    He, Shenguang
    Su, Pengwei
    Zhao, Feng
    TRANSACTIONS OF THE INSTITUTE OF MEASUREMENT AND CONTROL, 2024, 46 (15) : 2984 - 2995
  • [7] A Fast-start, Fault-tolerant MPI Launcher on Dawning Supercomputers
    Liu, Xu
    Tu, Bibo
    Zhan, Jianfeng
    Meng, Dan
    PDCAT 2008: NINTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2008, : 263 - 266
  • [8] SAFE: Scalable Autonomous Fault-tolerant Ethernet
    Kim, Kiyong
    Ryu, Yeonseung
    Rhee, Jong-myung
    Lee, Dong-ho
    11TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY, VOLS I-III, PROCEEDINGS,: UBIQUITOUS ICT CONVERGENCE MAKES LIFE BETTER!, 2009, : 365 - +
  • [9] Fault-tolerant control of an autonomous underwater vehicle
    Perrault, D
    Nahon, M
    OCEANS'98 - CONFERENCE PROCEEDINGS, VOLS 1-3, 1998, : 820 - 824
  • [10] A FAULT-TOLERANT DATAFLOW SYSTEM
    SRINI, VP
    COMPUTER, 1985, 18 (03) : 54 - 68