Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications

被引:0
|
作者
Laguna, Ignacio [1 ]
Ahn, Dong H. [2 ]
de Supinski, Bronis R. [2 ]
Bagchi, Saurabh [1 ]
Gamblin, Todd [2 ]
机构
[1] Purdue Univ, Sch Elect & Comp Engn, W Lafayette, IN 47907 USA
[2] Lawrence Livermore Natl Lab, Computat Directorate, Livermore, CA 94550 USA
基金
美国国家科学基金会;
关键词
Reliability; Performance;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. Most debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and to fix performance failures and correctness problems at scale. Our tool probabilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demonstrate that our tool can isolate the root cause of a particularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least progressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks.
引用
收藏
页码:213 / 222
页数:10
相关论文
共 50 条
  • [1] Diagnosis of Performance Faults in Large Scale MPI Applications via Probabilistic Progress-Dependence Inference
    Laguna, Ignacio
    Ahn, Dong H.
    de Supinski, Bronis R.
    Bagchi, Saurabh
    Gamblin, Todd
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (05) : 1280 - 1289
  • [2] Continuous Performance Monitoring for Large-Scale Parallel Applications
    Dooley, Isaac
    Lee, Chee Wai
    Kale, Laxmikant V.
    [J]. 16TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), PROCEEDINGS, 2009, : 445 - 452
  • [3] Parallel simulation of large-scale parallel applications
    Bagrodia, R
    Deelman, E
    Phan, T
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2001, 15 (01): : 3 - 12
  • [4] Performance Prediction for Large-Scale Parallel Applications Using Representative Replay
    Zhai, Jidong
    Chen, Wenguang
    Zheng, Weimin
    Li, Keqin
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2016, 65 (07) : 2184 - 2198
  • [5] Graph-Centric Performance Analysis for Large-Scale Parallel Applications
    Jin, Yuyang
    Wang, Haojie
    Zhong, Runxin
    Zhang, Chen
    Liao, Xia
    Zhang, Feng
    Zhai, Jidong
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (07) : 1221 - 1238
  • [7] PHANTOM: Predicting Performance of Parallel Applications on Large-Scale Parallel Machines Using a Single Node
    Zhai, Jidong
    Chen, Wenguang
    Zheng, Weimin
    [J]. ACM SIGPLAN NOTICES, 2010, 45 (05) : 305 - 314
  • [8] PHANTOM: Predicting Performance of Parallel Applications on Large-Scale Parallel Machines Using a Single Node
    Zhai, Jidong
    Chen, Wenguang
    Zheng, Weimin
    [J]. PPOPP 2010: PROCEEDINGS OF THE 2010 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2010, : 305 - 314
  • [9] Performance measurement and analysis of large-scale parallel applications on leadership computing systems
    Wylie, Brian J. N.
    Geimer, Markus
    Wolf, Felix
    [J]. SCIENTIFIC PROGRAMMING, 2008, 16 (2-3) : 167 - 181
  • [10] Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications
    Wolf, Felix
    Wylie, Brian J. N.
    Abraham, Erika
    Becker, Daniel
    Frings, Wolfgang
    Fuerlinger, Karl
    Geimer, Markus
    Hermanns, Marc-Andre
    Mohr, Bernd
    Moore, Shirley
    Pfeifer, Matthias
    Szebenyi, Zoltan
    [J]. TOOLS FOR HIGH PERFORMANCE COMPUTING, 2008, : 157 - +