Optimizing the Resource and Job Management System of an Academic HPC & Research Computing Facility

被引:0
|
作者
Varrette, Sebastien [1 ]
Kieffer, Emmanuel [1 ]
Pinel, Frederic [1 ]
机构
[1] Univ Luxembourg Luxembourg, Fac Sci Technol & Med FSTM, 2 Ave Univ, L-4365 Esch Sur Alzette, Luxembourg
关键词
Slurm; Fairsharing; Workload analysis;
D O I
10.1109/ISPDC55340.2022.00027
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High Performance Computing (HPC) is nowadays a strategic asset required to sustain the surging demands for massive processing and data-analytic capabilities. In practice, the effective management of such large scale and distributed computing infrastructures is left to a Resource and Job Management System (RJMS). This essential middleware component is responsible for managing the computing resources, handling user requests to allocate resources while providing an optimized framework for starting, executing and monitoring jobs on the allocated resources. The University of Luxembourg has been operating for 15 years a large academic HPC facility which relies since 2017 on the Slurm RJMS introduced on top of the flagship cluster Iris. The acquisition of a new liquid-cooled supercomputer named Aion which was released in 2021 was the occasion to deeply review and optimize the seminal Slurm configuration, the resource limits defined and the sustaining fairsharing algorithm. This paper presents the outcomes of this study and details the implemented RJMS policy. The impact of the decisions made over the supercomputers workloads is also described. In particular, the performance evaluation conducted highlights that when compared to the seminal configuration, the described and implemented environment brought concrete and measurable improvements with regards the platform utilization (+12.64%), the jobs efficiency (as measured by the average Wall-time Request Accuracy, improved by 110.81%) or the management and funding (increased by 10%). The systems demonstrated sustainable and scalable HPC performances, and this effort has led to a negligible penalty on the average slowdown metric (response time normalized by runtime), which was increased by 0.59% for job workloads covering a complete year of exercise. Overall, this new setup has been in production for 18 months on both supercomputers and the updated model proves to bring a fairer and more satisfying experience to the end users. The proposed configurations and policies may help other HPC centres when designing or improving the RJMS sustaining the job scheduling strategy at the advent of computing capacity expansions.
引用
收藏
页码:129 / 137
页数:9
相关论文
共 50 条
  • [41] Research on System of Teaching Management Shared Resource
    Qiao, Li
    Wang, Dong
    Xie, Liping
    Liu, Linyan
    PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON MECHANICAL ENGINEERING AND MECHANICS, VOLS 1 AND 2, 2009, : 2086 - 2090
  • [42] Research on Optimization of Resource Management and Task Scheduling Algorithm in Grid Computing
    Liu Feng
    Guo Wei-wei
    Liu Lan-bo
    2019 11TH INTERNATIONAL CONFERENCE ON MEASURING TECHNOLOGY AND MECHATRONICS AUTOMATION (ICMTMA 2019), 2019, : 812 - 815
  • [43] Research on Strategic Human Resource Management of Enterprises Based on Cloud Computing
    Hao, Yue
    PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON MANAGEMENT, EDUCATION, INFORMATION AND CONTROL (MEICI 2017), 2017, 156 : 107 - 112
  • [44] Research on Cloud Computing Microservice Resource Management Strategy Based on GraphGRU
    She, Rui
    Jia, Xiao
    Yan, Jin
    Li, Weihua
    2024 5TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATION, ICCEA 2024, 2024, : 822 - 825
  • [45] Research on a cloud model intelligent computing platform for water resource management
    Wang, Tao
    Duan, Jingjing
    Zhai, Jiaqi
    Zhao, Jing
    Gao, Yulong
    Gao, Feng
    Zhang, Longlong
    Zhao, Yifei
    JOURNAL OF HYDROINFORMATICS, 2024, 26 (11) : 2902 - 2927
  • [46] Research on Enterprise Strategic Human Resource Management Based on Cloud Computing
    Wang, Yanhua
    2018 INTERNATIONAL WORKSHOP ON ADVANCES IN SOCIAL SCIENCES (IWASS 2018), 2019, : 442 - 445
  • [47] A system-centric scheduling policy for optimizing objectives of application and resource in grid computing
    Li Chunlin
    Li Layuan
    COMPUTERS & INDUSTRIAL ENGINEERING, 2009, 57 (03) : 1052 - 1061
  • [48] RESEARCH ON MANAGEMENT MODEL IN THE MOBILE TRANSMISSION RESOURCE MANAGEMENT SYSTEM
    Wu, Yuexin
    Fan, Chunxiao
    Liu, Jie
    PROCEEDINGS OF 2009 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS TECHNOLOGY AND APPLICATIONS, 2009, : 88 - 90
  • [49] The effects of a gamified human resource management system on job satisfaction and engagement
    Silic, Mario
    Marzi, Giacomo
    Caputo, Andrea
    Bal, P. Matthijs
    HUMAN RESOURCE MANAGEMENT JOURNAL, 2020, 30 (02) : 260 - 277
  • [50] Measuring Employee Expectations in a Strategic Human Resource Management Research: Job Satisfaction
    Oraman, Yasemin
    Unakitan, Gokhan
    Selen, Ufuk
    PROCEEDINGS OF 7TH INTERNATIONAL STRATEGIC MANAGEMENT CONFERENCE, 2011, 24