Optimizing the Resource and Job Management System of an Academic HPC & Research Computing Facility

被引:0
|
作者
Varrette, Sebastien [1 ]
Kieffer, Emmanuel [1 ]
Pinel, Frederic [1 ]
机构
[1] Univ Luxembourg Luxembourg, Fac Sci Technol & Med FSTM, 2 Ave Univ, L-4365 Esch Sur Alzette, Luxembourg
关键词
Slurm; Fairsharing; Workload analysis;
D O I
10.1109/ISPDC55340.2022.00027
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High Performance Computing (HPC) is nowadays a strategic asset required to sustain the surging demands for massive processing and data-analytic capabilities. In practice, the effective management of such large scale and distributed computing infrastructures is left to a Resource and Job Management System (RJMS). This essential middleware component is responsible for managing the computing resources, handling user requests to allocate resources while providing an optimized framework for starting, executing and monitoring jobs on the allocated resources. The University of Luxembourg has been operating for 15 years a large academic HPC facility which relies since 2017 on the Slurm RJMS introduced on top of the flagship cluster Iris. The acquisition of a new liquid-cooled supercomputer named Aion which was released in 2021 was the occasion to deeply review and optimize the seminal Slurm configuration, the resource limits defined and the sustaining fairsharing algorithm. This paper presents the outcomes of this study and details the implemented RJMS policy. The impact of the decisions made over the supercomputers workloads is also described. In particular, the performance evaluation conducted highlights that when compared to the seminal configuration, the described and implemented environment brought concrete and measurable improvements with regards the platform utilization (+12.64%), the jobs efficiency (as measured by the average Wall-time Request Accuracy, improved by 110.81%) or the management and funding (increased by 10%). The systems demonstrated sustainable and scalable HPC performances, and this effort has led to a negligible penalty on the average slowdown metric (response time normalized by runtime), which was increased by 0.59% for job workloads covering a complete year of exercise. Overall, this new setup has been in production for 18 months on both supercomputers and the updated model proves to bring a fairer and more satisfying experience to the end users. The proposed configurations and policies may help other HPC centres when designing or improving the RJMS sustaining the job scheduling strategy at the advent of computing capacity expansions.
引用
收藏
页码:129 / 137
页数:9
相关论文
共 50 条
  • [31] Research of Cloud Computing in Management Information System
    Xiao, Zhengxing
    ENGINEERING SOLUTIONS FOR MANUFACTURING PROCESSES, PTS 1-3, 2013, 655-657 : 1826 - 1829
  • [32] Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters
    Chen, Ruobing
    Shi, Haosen
    Wu, Jinping
    Li, Yusen
    Liu, Xiaoguang
    Wang, Gang
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2023, 20 (03)
  • [33] External Resource, Job Characteristics, Competitive Strategy and the Formation of Human Resource Management System
    Zhang Ling
    Nie Ting
    Luo Yongtai
    Zhang Zhengtang
    HUMAN RESOURCES MANAGEMENT IN THE KNOWLEDGE ECONOMY ERA, VOLS I AND II, 2009, : 1230 - +
  • [34] Research and Design of Digital Learning Resource Management System in Meteorological Adult Training Based on Cloud Computing
    Hou, Jinfang
    31ST INTERNATIONAL CONFERENCE ON COMPUTERS IN EDUCATION, ICCE 2023, VOL II, 2023, : 249 - 253
  • [35] Design and research of cloud computing resource data monitoring system
    Wei Guanghui
    PROCEEDINGS OF THE 2016 7TH INTERNATIONAL CONFERENCE ON MECHATRONICS, CONTROL AND MATERIALS (ICMCM 2016), 2016, 104 : 40 - 43
  • [36] Resource management and job scheduling of China earthquake grid experiment system: Construction of resource management and job dynamic scheduling model ProRMJS']JS
    Hou Jian-min
    Liu Rui-feng
    Shan Bao-hua
    Zhao Yong
    Niu Ai-jun
    Zou Li-ye
    Hou Li-hua
    Han Jun
    EARTHQUAKE SCIENCE, 2006, 19 (06) : 695 - 703
  • [37] A resource management system for network computing using Java']Java
    Maheswaran, M
    Chen, H
    Pradhan, S
    Pantel, P
    Zheng, L
    Min, R
    Groner, T
    PROCEEDINGS OF THE FIFTH JOINT CONFERENCE ON INFORMATION SCIENCES, VOLS 1 AND 2, 2000, : 453 - 456
  • [38] Efficient dynamical system resource management method in cloud computing
    Wu, Tianshu
    Wang, Shuai
    Shi, Xiaoyu
    JOURNAL OF ENGINEERING-JOE, 2019, 2019 (23): : 8891 - 8894
  • [39] A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System
    Kim, Young-Ho
    Lim, Eun-Ji
    Cha, Gyu-Il
    Bae, Seung-Jo
    2015 17TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY (ICACT), 2015, : 701 - 705
  • [40] Research on the process resource management in CAPP system
    Hao Xiuqing
    Proceedings of the 3rd International Conference on Innovation & Management, Vols 1 and 2, 2006, : 986 - 990