ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters

被引:0
|
作者
Fang Lin
Yi Liu
Yayu Guo
Depei Qian
机构
[1] Beihang University,School of Computer Science and Engineering
来源
关键词
High-performance computing; Emulation system; Performance tuning; Debugging; Large-scale parallel programs;
D O I
暂无
中图分类号
学科分类号
摘要
Continuous scaling-up of high-performance computing systems has brought challenges to the debugging and tuning of large-scale parallel programs. Firstly, to locate bugs in a program or tune its performance, programmer often needs to execute the program in a specified scale repeatedly, which consumes massive resources; secondly, due to the extensively used job scheduling systems, programmers can only submit their programs as jobs and cannot interact with them, which restricts debugging efficiency and flexibility. To address these challenges, this paper proposes an emulation system that supports debugging and tuning of large-scale parallel programs by executing parallel programs in the desired scale on a small cluster. The program is firstly executed in the desired scale on the target HPC system to record necessary information; then, programmers can choose and re-execute a subset of processes of the program repeatedly on a small cluster, during which the emulation system controls the execution of the processes, and programmers can debug their programs by attaching tools to the selected processes. Moreover, our system supports popular CPU+GPU heterogeneous architecture. The system is evaluated on a small cluster, while a 1000-node system is used as the target HPC system; experimental results demonstrate the accuracy and efficiency of emulation-execution.
引用
收藏
页码:1635 / 1666
页数:31
相关论文
共 50 条
  • [21] An abstract interface for system software on large-scale clusters
    Fernandez, Juan
    Frachtenberg, Eitan
    Petrini, Fabrizio
    Sancho, Jose-Carlos
    [J]. COMPUTER JOURNAL, 2006, 49 (04): : 454 - 469
  • [22] Large-scale emulation for Content Centric Network
    Ma, Ge
    Chen, Zhen
    Liu, Hongjian
    Cao, Bin
    [J]. 2013 FOURTH INTERNATIONAL CONFERENCE ON NETWORKING AND DISTRIBUTED COMPUTING (ICNDC), 2013, : 100 - 104
  • [23] Large-Scale Parallel Method of Moments on CPU/MIC Heterogeneous Clusters
    Chen, Yan
    Zuo, Sheng
    Zhang, Yu
    Zhao, Xunwang
    Zhang, Huanhuan
    [J]. IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, 2017, 65 (07) : 3782 - 3787
  • [24] Identifying Scalability Bottlenecks for Large-Scale Parallel Programs with Graph Analysis
    Jin, Yuyang
    Wang, Haojie
    Tang, Xiongchao
    Hoefler, Torsten
    Liu, Xu
    Zhai, Jidong
    [J]. PROCEEDINGS OF THE 25TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING (PPOPP '20), 2020, : 409 - 410
  • [25] Genesis: A system for large-scale parallel network simulation
    Szymanski, BK
    Saifee, A
    Sastry, A
    Liu, Y
    Madnani, K
    [J]. 16TH WORKSHOP ON PARALLEL AND DISTRIBUTED SIMULATION, PROCEEDINGS, 2002, : 89 - 96
  • [26] Parallel model reduction of large-scale unstable system
    Benner, P
    Castillo, M
    Quintana-Orti, ES
    Quintana-Orti, G
    [J]. PARALLEL COMPUTING: SOFTWARE TECHNOLOGY, ALGORITHMS, ARCHITECTURES AND APPLICATIONS, 2004, 13 : 251 - 258
  • [27] A case study using automatic performance tuning for large-scale scientific programs
    Chung, I-Hsin
    Hollingsworth, Jeffrey K.
    [J]. HPDC-15: PROCEEDINGS OF THE 15TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, 2005, : 45 - 56
  • [28] Clusters and large-scale structure
    Bahcall, NA
    [J]. SEVENTEENTH TEXAS SYMPOSIUM ON RELATIVISTIC ASTROPHYSICS AND COSMOLOGY, 1995, 759 : 636 - 649
  • [29] A System Design of Tight Physical Integration for Large-scale Vehicular Network Emulation
    Kato, Arata
    Takai, Mineo
    Ishihara, Susumu
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING AND COMMUNICATIONS WORKSHOPS (PERCOM WORKSHOPS), 2019, : 742 - 747
  • [30] Tuning Parallel Data Compression and I/O for Large-scale Earthquake Simulation
    Tang, Houjun
    Byna, Suren
    Petersson, N. Anders
    McCallen, David
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 2992 - 2997