An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems

被引:43
|
作者
Wang, Shaoqi [1 ]
Gonzalez, Oscar J. [2 ]
Zhou, Xiaobo [1 ]
Williams, Thomas [2 ]
Friedman, Brian D. [2 ]
Havemann, Martin [2 ]
Woo, Thomas [2 ]
机构
[1] Univ Colorado, Dept Comp Sci, Colorado Springs, CO 80907 USA
[2] Nokia Bell Labs, New Providence, NJ USA
关键词
deep learning; GPU dusters; resource scheduling; container; Kubernetes;
D O I
10.1109/SC41405.2020.00094
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks. We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPIJ allocation mechanism works in concert with the scheduler. It offers a lightweight and non-intrusive method to reallocate Gl'Us based on a "SideCar" process that temporarily stops and restarts the job's DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU dusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination-based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] A Framework for Personalized Display Systems Using Non-Intrusive Features and Interest Detection
    Ateeq, Jawad
    2016 IEEE CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (CCECE), 2016,
  • [32] Non-Intrusive Detection of Adversarial Deep Learning Attacks via Observer Networks
    Sivamani, Kirthi Shankar
    Sahay, Rajeev
    Gamal, Aly El
    IEEE Letters of the Computer Society, 2020, 3 (01): : 25 - 28
  • [33] Non-intrusive Load Identification Algorithm Based on Feature Fusion and Deep Learning
    Wang S.
    Guo L.
    Chen H.
    Deng X.
    Dianli Xitong Zidonghua/Automation of Electric Power Systems, 2020, 44 (09): : 103 - 110
  • [34] Ensemble-Based Deep Learning Model for Non-Intrusive Load Monitoring
    Wang, Junfei
    El Kababji, Samer
    Graham, Connor
    Srikantha, Pirathayini
    2019 IEEE ELECTRICAL POWER AND ENERGY CONFERENCE (EPEC), 2019,
  • [35] Generalization Capacity Analysis of Non-Intrusive Load Monitoring using Deep Learning
    Cimen, Halil
    Palacios-Garcia, Emilio J.
    Cetinkaya, Nurettin
    Kolbak, Morten
    Sciume, Giuseppe
    Vasquez, Juan C.
    Guerrero, Josep M.
    20TH IEEE MEDITERRANEAN ELETROTECHNICAL CONFERENCE (IEEE MELECON 2020), 2020, : 216 - 220
  • [36] An Efficient Optimized Mouse and Keystroke Dynamics Framework for Continuous Non-Intrusive User Authentication
    Princy Ann Thomas
    K. Preetha Mathew
    Wireless Personal Communications, 2022, 124 : 401 - 422
  • [37] Federated Learning for Non-intrusive Load Monitoring
    Meng, Zhaorui
    Xie, Xiaozhu
    Xie, Yanqi
    IAENG International Journal of Applied Mathematics, 2023, 53 (03)
  • [38] A novel non-intrusive load monitoring technique using semi-supervised deep learning framework for smart grid
    Akbar, Mohammad Kaosain
    Amayri, Manar
    Bouguila, Nizar
    BUILDING SIMULATION, 2024, 17 (03) : 441 - 457
  • [39] A novel non-intrusive load monitoring technique using semi-supervised deep learning framework for smart grid
    Mohammad Kaosain Akbar
    Manar Amayri
    Nizar Bouguila
    Building Simulation, 2024, 17 : 441 - 457
  • [40] An Efficient Optimized Mouse and Keystroke Dynamics Framework for Continuous Non-Intrusive User Authentication
    Thomas, Princy Ann
    Mathew, K. Preetha
    WIRELESS PERSONAL COMMUNICATIONS, 2022, 124 (01) : 401 - 422