Measuring the Resiliency of Extreme-Scale Computing Environments

被引:10
|
作者
Bell Labs-Nokia, 600 Mountain Ave, New Provicence [1 ]
NJ
07974, United States
不详 [2 ]
IL
61801, United States
机构
来源
关键词
File organization - Graphics processing unit - Supercomputers;
D O I
10.1007/978-3-319-30599-8_24
中图分类号
学科分类号
摘要
This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applica-tions, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories. © Springer International Publishing Switzerland 2016.
引用
收藏
相关论文
共 50 条
  • [1] ARCHITECTURES FOR EXTREME-SCALE COMPUTING
    Torrellas, Josep
    COMPUTER, 2009, 42 (11) : 28 - 35
  • [2] Algorithm development for extreme-scale computing
    Jiachang Sun
    Chao Yang
    Xiao-Chuan Cai
    National Science Review, 2016, 3 (01) : 26 - 27
  • [3] Algorithm development for extreme-scale computing
    Sun, Jiachang
    Yang, Chao
    Cai, Xiao-Chuan
    NATIONAL SCIENCE REVIEW, 2016, 3 (01) : 26 - 27
  • [4] Extreme-scale parallel computing: bottlenecks and strategies
    Ze-yao Mo
    Frontiers of Information Technology & Electronic Engineering, 2018, 19 : 1251 - 1260
  • [5] Extreme-scale parallel computing: bottlenecks and strategies
    Mo, Ze-yao
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2018, 19 (10) : 1251 - 1260
  • [6] Accelerating incremental checkpointing for extreme-scale computing
    Ferreira, Kurt B.
    Riesen, Rolf
    Bridges, Patrick
    Arnold, Dorian
    Brightwell, Ron
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 30 : 66 - 77
  • [7] Epidemic Fault Tolerance for Extreme-Scale Parallel Computing
    Katti, Amogh
    Di Fatta, Giuseppe
    INTERNET AND DISTRIBUTED COMPUTING SYSTEMS, IDCS 2015, 2015, 9258 : 201 - 208
  • [8] ENERGY-EFFICIENT COMPUTING FOR EXTREME-SCALE SCIENCE
    Donofrio, David
    Oliker, Leonid
    Shalf, John
    Wehner, Michael F.
    Rowen, Chris
    Krueger, Jens
    Kamil, Shoaib
    Mohiyuddin, Marghoob
    COMPUTER, 2009, 42 (11) : 62 - 71
  • [9] Application health monitoring for extreme-scale resiliency using cooperative fault management
    Agarwal, Pratul K.
    Naughton, Thomas
    Park, Byung H.
    Bernholdt, David E.
    Hursey, Joshua J.
    Geist, Al
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (02):
  • [10] Hierarchical Krylov and nested Krylov methods for extreme-scale computing
    McInnes, Lois Curfman
    Smith, Barry
    Zhang, Hong
    Mills, Richard Tran
    PARALLEL COMPUTING, 2014, 40 (01) : 17 - 31