Measuring the Resiliency of Extreme-Scale Computing Environments

被引:10
|
作者
Bell Labs-Nokia, 600 Mountain Ave, New Provicence [1 ]
NJ
07974, United States
不详 [2 ]
IL
61801, United States
机构
来源
关键词
File organization - Graphics processing unit - Supercomputers;
D O I
10.1007/978-3-319-30599-8_24
中图分类号
学科分类号
摘要
This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applica-tions, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories. © Springer International Publishing Switzerland 2016.
引用
收藏
相关论文
共 50 条
  • [41] The Top 10 Challenges in Extreme-Scale Visual Analytics
    Wong, Pak Chung
    Shen, Han-Wei
    Johnson, Christopher R.
    Chen, Chaomei
    Ross, Robert B.
    IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2012, 32 (04) : 63 - 67
  • [42] Unified model for assessing checkpointing protocols at extreme-scale
    Bosilca, George
    Bouteiller, Aurelien
    Brunet, Elisabeth
    Cappello, Franck
    Dongarra, Jack
    Guermouche, Amina
    Herault, Thomas
    Robert, Yves
    Vivien, Frederic
    Zaidouni, Dounia
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2014, 26 (17): : 2772 - 2791
  • [43] HipMer: An Extreme-Scale De Novo Genome Assembler
    Georganas, Evangelos
    Buluc, Aydin
    Chapman, Jarrod
    Hofmeyr, Steven
    Aluru, Chaitanya
    Egan, Rob
    Oliker, Leonid
    Rokhsar, Daniel
    Yelick, Katherine
    PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,
  • [44] Processing Extreme-Scale Graphs on China's Supercomputers
    Zhang, Yiming
    Lu, Kai
    Chen, Wenguang
    COMMUNICATIONS OF THE ACM, 2021, 64 (11) : 60 - 63
  • [45] Profiling the Usage of an Extreme-Scale Archival Storage System
    Sim, Hyogi
    Vazhkudai, Sudharshan S.
    2019 IEEE 27TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2019), 2019, : 410 - 422
  • [46] Extreme-scale quantum and reactive molecular dynamics simulations
    Nakano, Aiichiro
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2017, 254
  • [47] Extreme-scale motions in turbulent plane Couette flows
    Lee, Myoungkyu
    Moser, Robert D.
    JOURNAL OF FLUID MECHANICS, 2018, 842 : 128 - 145
  • [48] A characterization of workflow management systems for extreme-scale applications
    da Silva, Rafael Ferreira
    Filgueira, Rosa
    Pietri, Ilia
    Jiang, Ming
    Sakellariou, Rizos
    Deelman, Ewa
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 75 : 228 - 238
  • [49] Programmer-guided reliability for extreme-scale applications
    Bernholdt, David E.
    Elwasif, Wael R.
    Kartsaklis, Christos
    Lee, Seyong
    Mintz, Tiffany M.
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2018, 32 (05): : 598 - 612
  • [50] Programmer-Guided Reliability for Extreme-Scale Applications
    Bernholdt, David E.
    Elwasif, Wael R.
    Kartsaklis, Christos
    Lee, Seyong
    Mintz, Tiffany M.
    2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, : 571 - 579