Testing SLURM open source batch system for a Tier1/Tier2 HEP computing facility

被引:1
|
作者
Donvito, Giacinto [1 ]
Salomoni, Davide
Italiano, Alessandro [1 ]
机构
[1] INFN Bari, Via Orabona 4, I-70126 Bari, Italy
关键词
D O I
10.1088/1742-6596/513/3/032027
中图分类号
O57 [原子核物理学、高能物理学];
学科分类号
070202 ;
摘要
In this work the testing activities that were carried on to verify if the SLURM batch system could be used as the production batch system of a typical Tier1/Tier2 HEP computing center are shown. SLURM (Simple Linux Utility for Resource Management) is an Open Source batch system developed mainly by the Lawrence Livermore National Laboratory, SchedMD, Linux NetworX, Hewlett-Packard, and Groupe Bull. Testing was focused both on verifying the functionalities of the batch system and the performance that SLURM is able to offer. We first describe our initial set of requirements. Functionally, we started configuring SLURM so that it replicates all the scheduling policies already used in production in the computing centers involved in the test, i.e. INFN-Bari and the INFN-Tierl at CNAF, Bologna. Currently, the INFN-Tierl is using IBM LSF (Load Sharing Facility), while INFN-Bari, an LHC Tier2 for both CMS and Alice, is using Torque as resource manager and MAUI as scheduler. We show how we configured SLURM in order to enable several scheduling functionalities such as Hierarchical FairShare, Quality of Service, user-based and group-based priority, limits on the number of jobs per user/group/queue, job age scheduling, job size scheduling, and scheduling of consumable resources. We then show how different job typologies, like serial, MPI, multi-thread, whole-node and interactive jobs can be managed. Tests on the use of ACLs on queues or in general other resources are then described. A peculiar SLURM feature we also verified is triggers on event, useful to configure specific actions on each possible event in the batch system. We also tested highly available configurations for the master node. This feature is of paramount importance since a mandatory requirement in our scenarios is to have a working farm cluster even in case of hardware failure of the server(s) hosting the batch system. Among our requirements there is also the possibility to deal with pre-execution and post-execution scripts, and controlled handling of the failure of such scripts. This feature is heavily used, for example, at the INFN-Tierl in order to check the health status of a worker node before execution of each job. Pre- and post-execution scripts are also important to let WNoDeS, the IaaS Cloud solution developed at INFN, use SLURM as its resource manager. WNoDeS has already been supporting the LSF and Torque batch systems for some time; in this work we show the work done so that WNoDeS supports SLURM as well. Finally, we show several performance tests that we carried on to verify SLURM scalability and reliability, detailing scalability tests both in terms of managed nodes and of queued jobs.
引用
收藏
页数:6
相关论文
共 3 条
  • [1] Testing an Open Source installation and server provisioning tool for the INFN CNAF Tier1 Storage system
    Pezzi, M.
    Favaro, M.
    Gregori, D.
    Ricci, P. P.
    Sapunenko, V.
    [J]. 20TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP2013), PARTS 1-6, 2014, 513
  • [2] Changing the batch system in a Tier 1 computing center: why and how
    Chierici, Andrea
    Dal Pra, Stefano
    [J]. 20TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP2013), PARTS 1-6, 2014, 513
  • [3] A self-configuring control system for storage and computing departments at INFN-CNAF Tier1
    Gregori, Daniele
    Dal Pra, Stefano
    Ricci, Pier Paolo
    Pezzi, Michele
    Prosperini, Andrea
    Sapunenko, Vladimir
    [J]. 16TH INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH (ACAT2014), 2015, 608