ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters

被引:0
|
作者
Fang Lin
Yi Liu
Yayu Guo
Depei Qian
机构
[1] Beihang University,School of Computer Science and Engineering
来源
关键词
High-performance computing; Emulation system; Performance tuning; Debugging; Large-scale parallel programs;
D O I
暂无
中图分类号
学科分类号
摘要
Continuous scaling-up of high-performance computing systems has brought challenges to the debugging and tuning of large-scale parallel programs. Firstly, to locate bugs in a program or tune its performance, programmer often needs to execute the program in a specified scale repeatedly, which consumes massive resources; secondly, due to the extensively used job scheduling systems, programmers can only submit their programs as jobs and cannot interact with them, which restricts debugging efficiency and flexibility. To address these challenges, this paper proposes an emulation system that supports debugging and tuning of large-scale parallel programs by executing parallel programs in the desired scale on a small cluster. The program is firstly executed in the desired scale on the target HPC system to record necessary information; then, programmers can choose and re-execute a subset of processes of the program repeatedly on a small cluster, during which the emulation system controls the execution of the processes, and programmers can debug their programs by attaching tools to the selected processes. Moreover, our system supports popular CPU+GPU heterogeneous architecture. The system is evaluated on a small cluster, while a 1000-node system is used as the target HPC system; experimental results demonstrate the accuracy and efficiency of emulation-execution.
引用
收藏
页码:1635 / 1666
页数:31
相关论文
共 50 条
  • [1] ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters
    Lin, Fang
    Liu, Yi
    Guo, Yayu
    Qian, Depei
    [J]. JOURNAL OF SUPERCOMPUTING, 2021, 77 (02): : 1635 - 1666
  • [2] Debugging large-scale, long-running parallel programs
    Kranzlmüller, D
    Thoai, N
    Volkert, J
    [J]. COMPUTATIONAL SCIENCE-ICCS 2002, PT II, PROCEEDINGS, 2002, 2330 : 913 - 922
  • [3] Supporting relative debugging for large-scale UPC programs
    Minh Ngoc Dinh
    Abramson, David
    Chao, Jin
    DeRose, Luiz
    Moench, Bob
    Gontarek, Andrew
    [J]. 2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, 2014, 29 : 1491 - 1503
  • [4] Message Leak Detection in Debugging Large-scale Parallel Applications
    Anh-Tu Do-Mai
    Thanh-Dang Diep
    Nam Thoai
    [J]. 2015 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND APPLICATIONS (ACOMP), 2015, : 82 - 89
  • [5] Accurate Application Progress Analysis for Large-Scale Parallel Debugging
    Mitra, Subrata
    Laguna, Ignacio
    Ahn, Dong H.
    Bagchi, Saurabh
    Schulz, Martin
    Gamblin, Todd
    [J]. ACM SIGPLAN NOTICES, 2014, 49 (06) : 193 - 203
  • [6] Solving large-scale semidefinite programs in parallel
    Nayakkankuppam, Madhu V.
    [J]. MATHEMATICAL PROGRAMMING, 2007, 109 (2-3) : 477 - 504
  • [7] Solving large-scale semidefinite programs in parallel
    Madhu V. Nayakkankuppam
    [J]. Mathematical Programming, 2007, 109 : 477 - 504
  • [8] Design and Implementation of a Runtime System for Parallel Numerical Simulations on Large-Scale Clusters
    Schliephake, Michael
    Aguilar, Xavier
    Laure, Erwin
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 2105 - 2114
  • [9] Parallel decomposition of large-scale stochastic nonlinear programs
    Birge, JR
    Rosa, CH
    [J]. ANNALS OF OPERATIONS RESEARCH, 1996, 64 : 39 - 65
  • [10] Research on the scalability of the large-scale parallel application programs
    Chen, Jun
    Mo, Zeyao
    Li, Xiaomei
    Yuan, Guoxing
    [J]. 2000, Sci Press (37):