Measuring the Resiliency of Extreme-Scale Computing Environments

被引：10

作者：

Bell Labs-Nokia, 600 Mountain Ave, New Provicence ^{[1
]}

07974, United States

不详 ^{[2
]}

61801, United States

机构：

来源：

Springer Ser. Reliab. Eng. | / 609-655期

关键词：

File organization - Graphics processing unit - Supercomputers;

D O I：

10.1007/978-3-319-30599-8_24

中图分类号：

学科分类号：

摘要：

This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applica-tions, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories. © Springer International Publishing Switzerland 2016.

引用

共 50 条

[1] ARCHITECTURES FOR EXTREME-SCALE COMPUTING
Torrellas, Josep
COMPUTER, 2009, 42 (11) : 28 - 35
[2] Algorithm development for extreme-scale computing
Jiachang Sun
Chao Yang
Xiao-Chuan Cai
National Science Review, 2016, 3 (01) : 26 - 27
[3] Algorithm development for extreme-scale computing
Sun, Jiachang
Yang, Chao
Cai, Xiao-Chuan
NATIONAL SCIENCE REVIEW, 2016, 3 (01) : 26 - 27
[4] Extreme-scale parallel computing: bottlenecks and strategies
Ze-yao Mo
Frontiers of Information Technology & Electronic Engineering, 2018, 19 : 1251 - 1260
[5] Extreme-scale parallel computing: bottlenecks and strategies
Mo, Ze-yao
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2018, 19 (10) : 1251 - 1260
[6] Accelerating incremental checkpointing for extreme-scale computing
Ferreira, Kurt B.
Riesen, Rolf
Bridges, Patrick
Arnold, Dorian
Brightwell, Ron
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 30 : 66 - 77
[7] Epidemic Fault Tolerance for Extreme-Scale Parallel Computing
Katti, Amogh
Di Fatta, Giuseppe
INTERNET AND DISTRIBUTED COMPUTING SYSTEMS, IDCS 2015, 2015, 9258 : 201 - 208
[8] ENERGY-EFFICIENT COMPUTING FOR EXTREME-SCALE SCIENCE
Donofrio, David
Oliker, Leonid
Shalf, John
Wehner, Michael F.
Rowen, Chris
Krueger, Jens
Kamil, Shoaib
Mohiyuddin, Marghoob
COMPUTER, 2009, 42 (11) : 62 - 71
[9] Application health monitoring for extreme-scale resiliency using cooperative fault management
Agarwal, Pratul K.
Naughton, Thomas
Park, Byung H.
Bernholdt, David E.
Hursey, Joshua J.
Geist, Al
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (02):
[10] Hierarchical Krylov and nested Krylov methods for extreme-scale computing
McInnes, Lois Curfman
Smith, Barry
Zhang, Hong
Mills, Richard Tran
PARALLEL COMPUTING, 2014, 40 (01) : 17 - 31

← 1 2 3 4 5 →