Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling

被引:11
|
作者
Chadha, Mohak [1 ]
John, Jophin [1 ]
Gerndt, Michael [1 ]
机构
[1] Tech Univ Munchen Garching Near Munich, Comp Architecture & Parallel Syst, Munich, Germany
关键词
Dynamic resource-management; malleability; SLURM; performance-aware; power-aware scheduling;
D O I
10.1109/HiPC50609.2020.00036
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the growing constraints on power budget and increasing hardware failure rates, the operation of future exascale systems faces several challenges. Towards this, resource awareness and adaptivity by enabling malleable jobs has been actively researched in the HPC community. Malleable jobs can change their computing resources at runtime and can significantly improve HPC system performance. However, due to the rigid nature of popular parallel programming paradigms such as MPI and lack of support for dynamic resource management in batch systems, malleable jobs have been largely unrealized. In this paper, we extend the SLURM batch system to support the execution and batch scheduling of malleable jobs. The malleable applications are written using a new adaptive parallel paradigm called Invasive MPI which extends the MPI standard to support resource-adaptivity at runtime. We propose two malleable job scheduling strategies to support performance-aware and power-aware dynamic reconfiguration decisions at runtime. We implement the strategies in SLURM and evaluate them on a production HPC system. Results for our performance-aware scheduling strategy show improvements in makespan, average system utilization, average response, and waiting times as compared to other scheduling strategies. Moreover, we demonstrate dynamic power corridor management using our power-aware strategy.
引用
收藏
页码:223 / 232
页数:10
相关论文
共 50 条
  • [41] Resource-aware event triggered distributed estimation over adaptive networks
    Utlu, Ihsan
    Kilic, O. Fatih
    Kozat, Suleyman S.
    [J]. DIGITAL SIGNAL PROCESSING, 2017, 68 : 127 - 137
  • [42] BFM: a Scalable and Resource-aware Method for Adaptive Mission Planning of UAVs
    Hireche, Chabha
    Dezan, Catherine
    Diguet, Jean-Philippe
    Mejias, Luis
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2018, : 6702 - 6707
  • [43] Praedixi, Redegi, Cogitavi: Adaptive knowledge for resource-aware semantic reasoning
    Bobed, Carlos
    Bobillo, Fernando
    Jimenez-Ruiz, Ernesto
    Mena, Eduardo
    Pan, Jeff Z.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 250
  • [44] A resource-aware scheduling algorithm with reduced task duplication on heterogeneous computing systems
    Mei, Jing
    Li, Kenli
    Li, Keqin
    [J]. JOURNAL OF SUPERCOMPUTING, 2014, 68 (03): : 1347 - 1377
  • [45] Resource-Aware Partitioned Scheduling for Heterogeneous Multicore Real-Time Systems
    Han, Jian-Jun
    Cai, Wen
    Zhu, Dakai
    [J]. 2018 55TH ACM/ESDA/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2018,
  • [46] Multi-stage resource-aware scheduling for data centers with heterogeneous servers
    Tran, Tony T.
    Padmanabhan, Meghana
    Zhang, Peter Yun
    Li, Heyse
    Down, Douglas G.
    Beck, J. Christopher
    [J]. JOURNAL OF SCHEDULING, 2018, 21 (02) : 251 - 267
  • [47] Resource-Aware Scheduling in Heterogeneous, Multi-core Clusters for Energy Efficiency
    Tran, Xuan T.
    [J]. ADVANCES IN INFORMATION AND COMMUNICATION TECHNOLOGY, 2017, 538 : 520 - 529
  • [48] Faster Model-Based Optimization Through Resource-Aware Scheduling Strategies
    Richter, Jakob
    Kotthaus, Helena
    Bischl, Bernd
    Marwedel, Peter
    Rahnenfuehrer, Joerg
    Lang, Michel
    [J]. LEARNING AND INTELLIGENT OPTIMIZATION (LION 10), 2016, 10079 : 267 - 273
  • [49] A resource-aware scheduling algorithm with reduced task duplication on heterogeneous computing systems
    Jing Mei
    Kenli Li
    Keqin Li
    [J]. The Journal of Supercomputing, 2014, 68 : 1347 - 1377
  • [50] Performance Analysis of Resource-Aware Task Scheduling Methods in Wireless Sensor Networks
    Khan, Muhidul Islam
    Rinner, Bernhard
    [J]. INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2014,