Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications

被引:0
|
作者
Laguna, Ignacio [1 ]
Ahn, Dong H. [2 ]
de Supinski, Bronis R. [2 ]
Bagchi, Saurabh [1 ]
Gamblin, Todd [2 ]
机构
[1] Purdue Univ, Sch Elect & Comp Engn, W Lafayette, IN 47907 USA
[2] Lawrence Livermore Natl Lab, Computat Directorate, Livermore, CA 94550 USA
基金
美国国家科学基金会;
关键词
Reliability; Performance;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. Most debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and to fix performance failures and correctness problems at scale. Our tool probabilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demonstrate that our tool can isolate the root cause of a particularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least progressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks.
引用
下载
收藏
页码:213 / 222
页数:10
相关论文
共 50 条
  • [41] Real-Time Probabilistic Data Fusion for Large-Scale IoT Applications
    Akbar, Adnan
    Kousiouris, George
    Pervaiz, Haris
    Sancho, Juan
    Ta-Shma, Paula
    Carrez, Francois
    Moessner, Klaus
    IEEE ACCESS, 2018, 6 : 10015 - 10027
  • [42] Performance Analysis for Large-Scale Parallel Microscopic Traffic Simulation System
    Yin Fei
    Zhang Dongliang
    INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2009, 5 (01): : 92 - 92
  • [43] Performance analysis of a parallel algorithm for restoring large-scale CT images
    Harizanov, Stanislav
    Lirkov, Ivan
    Georgiev, Krassimir
    Paprzycki, Marcin
    Ganzha, Maria
    JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2017, 310 : 104 - 114
  • [44] Visual Diagnostics of Parallel Performance in Training Large-Scale DNN Models
    Wei, Yating
    Wang, Zhiyong
    Wang, Zhongwei
    Dai, Yong
    Ou, Gongchang
    Gao, Han
    Yang, Haitao
    Wang, Yue
    Cao, Caleb Chen
    Weng, Luoxuan
    Lu, Jiaying
    Zhu, Rongchen
    Chen, Wei
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (07) : 3915 - 3929
  • [45] Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression
    Sprangers, Olivier
    Schelter, Sebastian
    de Rijke, Maarten
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 1510 - 1520
  • [46] Design of large-scale parallel simulations
    Knepley, MG
    Sameh, AH
    Sarin, V
    PARALLEL COMPUTATIONAL FLUID DYNAMICS: TOWARDS TERAFLOPS, OPTIMIZATION, AND NOVEL FORMULATIONS, 2000, : 273 - 279
  • [47] Parallel genesis for large-scale modeling
    Goddard, NH
    Hood, G
    COMPUTATIONAL NEUROSCIENCE: TRENDS IN RESEARCH, 1997, 1997, : 911 - 917
  • [48] Large-Scale Parallel Computing on Grids
    Bal, Henri
    Verstoep, Kees
    ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2008, 220 (02) : 3 - 17
  • [49] Large-scale parallel data clustering
    Judd, D
    McKinley, PK
    Jain, AK
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1998, 20 (08) : 871 - 876
  • [50] A Large-scale Parallel Fuzzing System
    Li, Yang
    Feng, Chao
    Tang, Chaojing
    ICAIP 2018: 2018 THE 2ND INTERNATIONAL CONFERENCE ON ADVANCES IN IMAGE PROCESSING, 2018, : 194 - 197