Accurate Application Progress Analysis for Large-Scale Parallel Debugging

被引:0
|
作者
Mitra, Subrata [1 ]
Laguna, Ignacio [2 ]
Ahn, Dong H. [2 ]
Bagchi, Saurabh [1 ]
Schulz, Martin [2 ]
Gamblin, Todd [2 ]
机构
[1] Purdue Univ, W Lafayette, IN 47907 USA
[2] Lawrence Livermore Natl Lab, Livermore, CA USA
关键词
Parallel debugging; high-performance computing; dynamic analysis; MPI; Performance; Algorithms; Reliability; Measurement;
D O I
10.1145/2666356.2594336
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, PRODOMETER, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.
引用
收藏
页码:193 / 203
页数:11
相关论文
共 50 条
  • [1] AN INTEGRATED APPROACH TO PARALLEL PROGRAM DEBUGGING AND PERFORMANCE ANALYSIS ON LARGE-SCALE MULTIPROCESSORS
    FOWLER, RJ
    LEBLANC, TJ
    MELLORCRUMMEY, JM
    [J]. SIGPLAN NOTICES, 1989, 24 (01): : 163 - 173
  • [2] Debugging large-scale, long-running parallel programs
    Kranzlmüller, D
    Thoai, N
    Volkert, J
    [J]. COMPUTATIONAL SCIENCE-ICCS 2002, PT II, PROCEEDINGS, 2002, 2330 : 913 - 922
  • [3] Message Leak Detection in Debugging Large-scale Parallel Applications
    Anh-Tu Do-Mai
    Thanh-Dang Diep
    Nam Thoai
    [J]. 2015 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND APPLICATIONS (ACOMP), 2015, : 82 - 89
  • [4] ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters
    Lin, Fang
    Liu, Yi
    Guo, Yayu
    Qian, Depei
    [J]. JOURNAL OF SUPERCOMPUTING, 2021, 77 (02): : 1635 - 1666
  • [5] ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters
    Fang Lin
    Yi Liu
    Yayu Guo
    Depei Qian
    [J]. The Journal of Supercomputing, 2021, 77 : 1635 - 1666
  • [6] A parallel algorithm for analysis of large-scale networks
    Alexander, AE
    Ali, AL
    [J]. COMPUTERS & INDUSTRIAL ENGINEERING, 1996, 31 (1-2) : 375 - 378
  • [7] Research on the scalability of the large-scale parallel application programs
    Chen, Jun
    Mo, Zeyao
    Li, Xiaomei
    Yuan, Guoxing
    [J]. 2000, Sci Press (37):
  • [8] Modeling Application Resilience in Large-scale Parallel Execution
    Wu, Kai
    Dong, Wenqian
    Guan, Qiang
    DeBardeleben, Nathan
    Li, Dong
    [J]. PROCEEDINGS OF THE 47TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 2018,
  • [9] Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis
    Majid, Abdul
    Khan, Mukhtaj
    Iqbal, Nadeem
    Jan, Mian Ahmad
    Khan, Mushtaq
    Salman
    [J]. JOURNAL OF GRID COMPUTING, 2019, 17 (02) : 313 - 324
  • [10] Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis
    Abdul Majid
    Mukhtaj Khan
    Nadeem Iqbal
    Mian Ahmad Jan
    Mushtaq Khan
    [J]. Journal of Grid Computing, 2019, 17 : 313 - 324