Probabilistic Reservation Services for Large-Scale Batch-Scheduled Systems

被引：3

作者：

Nurmi, Daniel ^{[1
]}

Wolski, Rich ^{[1
]}

Brevik, John ^{[2
]}

机构：

[1] Univ Calif Santa Barbara, Dept Comp Sci, Santa Barbara, CA 93106 USA

[2] Calif State Univ Long Beach, Dept Math & Stat, Long Beach, CA 90840 USA

来源：

IEEE SYSTEMS JOURNAL | 2009年 / 3卷 / 01期

关键词：

D O I：

10.1109/JSYST.2008.2011303

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In high-performance computing (HPC) settings, in which multiprocessor machines are shared among users with potentially competing resource demands, processors are allocated to user workload using space sharing. Typically, users interact with a given machine by submitting their jobs to a centralized batch scheduler that implements a site-specific, and often partially hidden, policy designed to maximize machine utilization while providing tolerable turnaround times. In practice, while most HPC systems experience good utilization levels, the amount of time experienced by individual jobs waiting to begin execution has been shown to be highly variable and difficult to predict, leading to user confusion and/or frustration. One method for dealing with this uncertainty that has been proposed is the ability to predict the amount of time that individual jobs will wait in batch queues once they are submitted, thus allowing a user to reason about the total time between job submission and job completion (which we term a job's "overall turnaround time"). Another related but distinct method for handling the uncertainty is to allow users who are willing to plan ahead to make "advanced reservations" for processor resources, again allowing them to reason about job turnaround time. To date, however, few if any HPC centers provide either job-queue delay prediction services or advanced reservation capabilities to their general user populations. In this paper, we describe QBETS, VARQ, and CO-VARQ, new methods for allowing users to reason and control the overall turnaround time of their batch-queue jobs submitted to busy HPC systems in existence today. QBETS is an online, non-parametric system for predicting statistical bounds on the amount of time individual batch jobs will wait in queue. VARQ is a new method for job scheduling that provides users with probabilistic "virtual" advanced reservations using only existing best effort batch schedulers and policies, and CO-VARQ utilizes this capability to implement a general coallocation service. QBETS, VARQ and CO-VARQ operate as overlays, requiring no modification to the local scheduler implementation or policies. We describe the statistical methods we use to implement the systems, detail empirical evaluations of their effectiveness in a number of HPC settings, and explore the potential future impact of these systems should they become widely used.

引用

页码：6 / 24

页数：19

共 50 条

[1] Probabilistic Advanced Reservations for Batch-scheduled Parallel Machines
Nurmi, Daniel
Wolski, Rich
Brevik, John
PPOPP'08: PROCEEDINGS OF THE 2008 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2008, : 289 - 290
[2] Probabilistic reliable dissemination in large-scale systems
Kermarrec, AM
Massoulié, L
Ganesh, AJ
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2003, 14 (03) : 248 - 258
[3] Large-scale queuing systems and services pricing
Sevastianov, L. A.
Vasilyev, S. A.
2017 9TH INTERNATIONAL CONGRESS ON ULTRA MODERN TELECOMMUNICATIONS AND CONTROL SYSTEMS AND WORKSHOPS (ICUMT), 2017, : 7 - 12
[4] On the Throughput Optimization in Large-scale Batch-processing Systems
Kar, Sounak
Rehrmann, Robin
Mukhopadhyay, Arpan
Alt, Bastian
Ciucu, Florin
Koeppl, Heinz
Binnig, Carsten
Rizk, Amr
PERFORMANCE EVALUATION, 2020, 144
[5] Efficient and Lightweight Batch Authentication for Large-Scale RFID Systems
Li, Binbin
Liu, Wenyuan
Wang, Lin
IEEE WIRELESS COMMUNICATIONS LETTERS, 2019, 8 (04) : 1272 - 1275
[6] On the Throughput Optimization in Large-Scale Batch-Processing Systems
Kar S.
Rehrmann R.
Mukhopadhyay A.
Alt B.
Ciucu F.
Koeppl H.
Binnig C.
Rizk A.
Performance Evaluation Review, 2021, 48 (03): : 128 - 129
[7] TOWARDS ENERGY AWARE RESERVATION INFRASTRUCTURE FOR LARGE-SCALE EXPERIMENTAL DISTRIBUTED SYSTEMS
Lefevre, Laurent
Orgerie, Anne-Cecile
PARALLEL PROCESSING LETTERS, 2009, 19 (03) : 419 - 433
[8] ERIDIS: ENERGY-EFFICIENT RESERVATION INFRASTRUCTURE FOR LARGE-SCALE DISTRIBUTED SYSTEMS
Orgerie, Anne-Cecile
Lefevre, Laurent
PARALLEL PROCESSING LETTERS, 2011, 21 (02) : 133 - 154
[9] Probabilistic queries in large-scale networks
Pedone, F
Duarte, NL
Goulart, M
DEPENDABLE COMPUTING: EDCC-4, PROCEEDINGS, 2002, 2485 : 209 - 226
[10] Towards Large-Scale Probabilistic OBDA
Schoenfisch, Joerg
Stuckenschmidt, Heiner
SCALABLE UNCERTAINTY MANAGEMENT (SUM 2015), 2015, 9310 : 106 - 120

← 1 2 3 4 5 →