Measuring the Resiliency of Extreme-Scale Computing Environments

被引:10
|
作者
Bell Labs-Nokia, 600 Mountain Ave, New Provicence [1 ]
NJ
07974, United States
不详 [2 ]
IL
61801, United States
机构
来源
关键词
File organization - Graphics processing unit - Supercomputers;
D O I
10.1007/978-3-319-30599-8_24
中图分类号
学科分类号
摘要
This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applica-tions, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories. © Springer International Publishing Switzerland 2016.
引用
收藏
相关论文
共 50 条
  • [31] The 5th Annual Workshop on Extreme-Scale Experiment-in-the-Loop Computing (XLOOP)
    Wozniak, Justin M.
    Schwarz, Nicholas
    ACM International Conference Proceeding Series, 2023,
  • [32] Toward an extreme-scale electronic structure system
    Vallejo, Jorge L. Galvez
    Snowdon, Calum
    Stocks, Ryan
    Kazemian, Fazeleh
    Yu, Fiona Chuo Yan
    Seidl, Christopher
    Seeger, Zoe
    Alkan, Melisa
    Poole, David
    Westheimer, Bryce M.
    Basha, Mehaboob
    De La Pierre, Marco
    Rendell, Alistair
    Izgorodina, Ekaterina I.
    Gordon, Mark S.
    Barca, Giuseppe M. J.
    JOURNAL OF CHEMICAL PHYSICS, 2023, 159 (04):
  • [33] Accelerating Extreme-Scale Numerical Weather Prediction
    Deconinck, Willem
    Hamrud, Mats
    Kuehnlein, Christian
    Mozdzynski, George
    Smolarkiewicz, Piotr K.
    Szmelter, Joanna
    Wedi, Nils P.
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, PPAM 2015, PT II, 2016, 9574 : 583 - 593
  • [34] Extreme-scale earthquake simulations on Sunway TaihuLight
    Fu, Haohuan
    Chen, Bingwei
    Zhang, Wenqiang
    Zhang, Zhenguo
    Zhang, Wei
    Yang, Guangwen
    Chen, Xiaofei
    CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2019, 1 (01) : 14 - 24
  • [35] A Synopses Data Engine for Interactive Extreme-Scale Analytics
    Kontaxakis, Antonis
    Giatrakos, Nikos
    Deligiannakis, Antonios
    CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 2085 - 2088
  • [36] In Situ Analysis and Visualization of Extreme-Scale Particle Simulations
    Dutta, Soumya
    Lipsa, Dan
    Turton, Terece L.
    Geveci, Berk
    Ahrens, James
    HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2022 INTERNATIONAL WORKSHOPS, 2022, 13387 : 283 - 294
  • [37] Interactive Extreme-Scale Analytics Towards Battling Cancer
    Giatrakos, Nikos
    Katzouris, Nikos
    Deligiannakis, Antonios
    Artikis, Alexander
    Garofalakis, Minos
    Paliouras, George
    Arndt, Holger
    Grasso, Raffaele
    Klinkenberg, Ralf
    Ponce De Leon, Miguel
    Gaetano Tartaglia, Gian
    Valencia, Alfonso
    Zissis, Dimitrios
    IEEE TECHNOLOGY AND SOCIETY MAGAZINE, 2019, 38 (02) : 54 - 61
  • [38] Topic 14+16: High-Performance and Scientific Applications and Extreme-Scale Computing (Introduction)
    Downes, Turlough P.
    Roller, Sabine
    Seitsonen, Ari P.
    Valcke, Sophie
    Keyes, David
    Sawley, Marie-Christine
    Schulthess, Thomas
    Shalf, John
    EURO-PAR 2013 PARALLEL PROCESSING, 2013, 8097 : 737 - 738
  • [39] Design and Implementation of Broadcast Algorithms for Extreme-Scale Systems
    Shamis, Pavel
    Graham, Richard
    Venkata, Manjunath Gorentla
    Ladd, Joshua
    2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 74 - 83
  • [40] Mapping extreme-scale alignments of quasar polarization vectors
    Hutsemékers, D.
    Cabanac, R.
    Lamy, H.
    Sluse, D.
    Astronomy and Astrophysics, 2005, 441 (03): : 915 - 930