Optimizing the Resource and Job Management System of an Academic HPC & Research Computing Facility

被引:0
|
作者
Varrette, Sebastien [1 ]
Kieffer, Emmanuel [1 ]
Pinel, Frederic [1 ]
机构
[1] Univ Luxembourg Luxembourg, Fac Sci Technol & Med FSTM, 2 Ave Univ, L-4365 Esch Sur Alzette, Luxembourg
关键词
Slurm; Fairsharing; Workload analysis;
D O I
10.1109/ISPDC55340.2022.00027
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High Performance Computing (HPC) is nowadays a strategic asset required to sustain the surging demands for massive processing and data-analytic capabilities. In practice, the effective management of such large scale and distributed computing infrastructures is left to a Resource and Job Management System (RJMS). This essential middleware component is responsible for managing the computing resources, handling user requests to allocate resources while providing an optimized framework for starting, executing and monitoring jobs on the allocated resources. The University of Luxembourg has been operating for 15 years a large academic HPC facility which relies since 2017 on the Slurm RJMS introduced on top of the flagship cluster Iris. The acquisition of a new liquid-cooled supercomputer named Aion which was released in 2021 was the occasion to deeply review and optimize the seminal Slurm configuration, the resource limits defined and the sustaining fairsharing algorithm. This paper presents the outcomes of this study and details the implemented RJMS policy. The impact of the decisions made over the supercomputers workloads is also described. In particular, the performance evaluation conducted highlights that when compared to the seminal configuration, the described and implemented environment brought concrete and measurable improvements with regards the platform utilization (+12.64%), the jobs efficiency (as measured by the average Wall-time Request Accuracy, improved by 110.81%) or the management and funding (increased by 10%). The systems demonstrated sustainable and scalable HPC performances, and this effort has led to a negligible penalty on the average slowdown metric (response time normalized by runtime), which was increased by 0.59% for job workloads covering a complete year of exercise. Overall, this new setup has been in production for 18 months on both supercomputers and the updated model proves to bring a fairer and more satisfying experience to the end users. The proposed configurations and policies may help other HPC centres when designing or improving the RJMS sustaining the job scheduling strategy at the advent of computing capacity expansions.
引用
收藏
页码:129 / 137
页数:9
相关论文
共 50 条
  • [1] Resource Management System for HPC Computing
    Niewiadomska-Szynkiewicz, Ewa
    Arabas, Piotr
    AUTOMATION 2018: ADVANCES IN AUTOMATION, ROBOTICS AND MEASUREMENT TECHNIQUES, 2018, 743 : 52 - 61
  • [2] Research on novel dynamic resource management and job scheduling in grid computing
    Li, Fufang
    Qi, Deyu
    Zhang, Limin
    Zhang, Xianguang
    Zhang, Zhili
    FIRST INTERNATIONAL MULTI-SYMPOSIUMS ON COMPUTER AND COMPUTATIONAL SCIENCES (IMSCCS 2006), PROCEEDINGS, VOL 1, 2006, : 709 - +
  • [3] Research and Implementation of Cloud Computing Resource Management System
    Jiang Yi-Lian
    AGRO FOOD INDUSTRY HI-TECH, 2017, 28 (03): : 452 - 456
  • [4] Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing Environments
    Brandt, Jim
    Gentile, Ann
    Mayo, Jackson
    Pebay, Philippe
    Roe, Diana
    Thompson, David
    Wong, Matthew
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 3050 - 3057
  • [5] A survey of job scheduling and resource management in grid computing
    Sharma, Raksha
    Soni, Vishnu Kant
    Mishra, Manoj Kumar
    Bhuyan, Prachet
    World Academy of Science, Engineering and Technology, 2010, 64 : 461 - 466
  • [6] JOSHUA: Symmetric active/active replication for highly available HPC job and resource management
    Uhlemann, K.
    Engelmann, C.
    Scott, S. L.
    2006 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, VOLS 1 AND 2, 2006, : 173 - +
  • [7] A Distributed Cloud Resource Management Framework for High-Performance Computing (HPC) Applications
    Govindarajan, Kannan
    Kumar, Vivekanandan Suresh
    Somasundaram, Thamarai Selvi
    2016 EIGHTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING (ICOAC), 2017, : 1 - 6
  • [8] AccaSim: a customizable workload management simulator for job dispatching research in HPC systems
    Cristian Galleguillos
    Zeynep Kiziltan
    Alessio Netti
    Ricardo Soto
    Cluster Computing, 2020, 23 : 107 - 122
  • [9] AccaSim: a customizable workload management simulator for job dispatching research in HPC systems
    Galleguillos, Cristian
    Kiziltan, Zeynep
    Netti, Alessio
    Soto, Ricardo
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (01): : 107 - 122
  • [10] Academic Management System: A Cloud Computing Approach
    Feng, Qi
    INTERNATIONAL SYMPOSIUM ON FUZZY SYSTEMS, KNOWLEDGE DISCOVERY AND NATURAL COMPUTATION (FSKDNC 2014), 2014, : 212 - 219