Probabilistic Reservation Services for Large-Scale Batch-Scheduled Systems

被引:3
|
作者
Nurmi, Daniel [1 ]
Wolski, Rich [1 ]
Brevik, John [2 ]
机构
[1] Univ Calif Santa Barbara, Dept Comp Sci, Santa Barbara, CA 93106 USA
[2] Calif State Univ Long Beach, Dept Math & Stat, Long Beach, CA 90840 USA
来源
IEEE SYSTEMS JOURNAL | 2009年 / 3卷 / 01期
关键词
D O I
10.1109/JSYST.2008.2011303
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In high-performance computing (HPC) settings, in which multiprocessor machines are shared among users with potentially competing resource demands, processors are allocated to user workload using space sharing. Typically, users interact with a given machine by submitting their jobs to a centralized batch scheduler that implements a site-specific, and often partially hidden, policy designed to maximize machine utilization while providing tolerable turnaround times. In practice, while most HPC systems experience good utilization levels, the amount of time experienced by individual jobs waiting to begin execution has been shown to be highly variable and difficult to predict, leading to user confusion and/or frustration. One method for dealing with this uncertainty that has been proposed is the ability to predict the amount of time that individual jobs will wait in batch queues once they are submitted, thus allowing a user to reason about the total time between job submission and job completion (which we term a job's "overall turnaround time"). Another related but distinct method for handling the uncertainty is to allow users who are willing to plan ahead to make "advanced reservations" for processor resources, again allowing them to reason about job turnaround time. To date, however, few if any HPC centers provide either job-queue delay prediction services or advanced reservation capabilities to their general user populations. In this paper, we describe QBETS, VARQ, and CO-VARQ, new methods for allowing users to reason and control the overall turnaround time of their batch-queue jobs submitted to busy HPC systems in existence today. QBETS is an online, non-parametric system for predicting statistical bounds on the amount of time individual batch jobs will wait in queue. VARQ is a new method for job scheduling that provides users with probabilistic "virtual" advanced reservations using only existing best effort batch schedulers and policies, and CO-VARQ utilizes this capability to implement a general coallocation service. QBETS, VARQ and CO-VARQ operate as overlays, requiring no modification to the local scheduler implementation or policies. We describe the statistical methods we use to implement the systems, detail empirical evaluations of their effectiveness in a number of HPC settings, and explore the potential future impact of these systems should they become widely used.
引用
收藏
页码:6 / 24
页数:19
相关论文
共 50 条
  • [1] Probabilistic Advanced Reservations for Batch-scheduled Parallel Machines
    Nurmi, Daniel
    Wolski, Rich
    Brevik, John
    PPOPP'08: PROCEEDINGS OF THE 2008 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2008, : 289 - 290
  • [2] Probabilistic reliable dissemination in large-scale systems
    Kermarrec, AM
    Massoulié, L
    Ganesh, AJ
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2003, 14 (03) : 248 - 258
  • [3] Large-scale queuing systems and services pricing
    Sevastianov, L. A.
    Vasilyev, S. A.
    2017 9TH INTERNATIONAL CONGRESS ON ULTRA MODERN TELECOMMUNICATIONS AND CONTROL SYSTEMS AND WORKSHOPS (ICUMT), 2017, : 7 - 12
  • [4] On the Throughput Optimization in Large-scale Batch-processing Systems
    Kar, Sounak
    Rehrmann, Robin
    Mukhopadhyay, Arpan
    Alt, Bastian
    Ciucu, Florin
    Koeppl, Heinz
    Binnig, Carsten
    Rizk, Amr
    PERFORMANCE EVALUATION, 2020, 144
  • [5] Efficient and Lightweight Batch Authentication for Large-Scale RFID Systems
    Li, Binbin
    Liu, Wenyuan
    Wang, Lin
    IEEE WIRELESS COMMUNICATIONS LETTERS, 2019, 8 (04) : 1272 - 1275
  • [6] On the Throughput Optimization in Large-Scale Batch-Processing Systems
    Kar S.
    Rehrmann R.
    Mukhopadhyay A.
    Alt B.
    Ciucu F.
    Koeppl H.
    Binnig C.
    Rizk A.
    Performance Evaluation Review, 2021, 48 (03): : 128 - 129
  • [7] TOWARDS ENERGY AWARE RESERVATION INFRASTRUCTURE FOR LARGE-SCALE EXPERIMENTAL DISTRIBUTED SYSTEMS
    Lefevre, Laurent
    Orgerie, Anne-Cecile
    PARALLEL PROCESSING LETTERS, 2009, 19 (03) : 419 - 433
  • [8] ERIDIS: ENERGY-EFFICIENT RESERVATION INFRASTRUCTURE FOR LARGE-SCALE DISTRIBUTED SYSTEMS
    Orgerie, Anne-Cecile
    Lefevre, Laurent
    PARALLEL PROCESSING LETTERS, 2011, 21 (02) : 133 - 154
  • [9] Probabilistic queries in large-scale networks
    Pedone, F
    Duarte, NL
    Goulart, M
    DEPENDABLE COMPUTING: EDCC-4, PROCEEDINGS, 2002, 2485 : 209 - 226
  • [10] Towards Large-Scale Probabilistic OBDA
    Schoenfisch, Joerg
    Stuckenschmidt, Heiner
    SCALABLE UNCERTAINTY MANAGEMENT (SUM 2015), 2015, 9310 : 106 - 120