An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems

被引:43
|
作者
Wang, Shaoqi [1 ]
Gonzalez, Oscar J. [2 ]
Zhou, Xiaobo [1 ]
Williams, Thomas [2 ]
Friedman, Brian D. [2 ]
Havemann, Martin [2 ]
Woo, Thomas [2 ]
机构
[1] Univ Colorado, Dept Comp Sci, Colorado Springs, CO 80907 USA
[2] Nokia Bell Labs, New Providence, NJ USA
关键词
deep learning; GPU dusters; resource scheduling; container; Kubernetes;
D O I
10.1109/SC41405.2020.00094
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks. We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPIJ allocation mechanism works in concert with the scheduler. It offers a lightweight and non-intrusive method to reallocate Gl'Us based on a "SideCar" process that temporarily stops and restarts the job's DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU dusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination-based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] A Servlet Grouping Framework with Non-intrusive Implementation
    Xu, Min
    Li, Ning
    COMPONENTS, PACKAGING AND MANUFACTURING TECHNOLOGY, 2011, 460-461 : 587 - 592
  • [22] FederatedNILM: A Distributed and Privacy-Preserving Framework for Non-Intrusive Load Monitoring Based on Federated Deep Learning
    Dai, Shuang
    Meng, Fanlin
    Wang, Qian
    Chen, Xizhong
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [23] Non-intrusive model reduction of large-scale, nonlinear dynamical systems using deep learning
    Gao, Han
    Wang, Jian-Xun
    Zahr, Matthew J.
    PHYSICA D-NONLINEAR PHENOMENA, 2020, 412
  • [24] Self-Adaptive Non-Intrusive Load Monitoring Using Deep Learning
    Arampola, S. M. L.
    Nisakya, M. S. K.
    Yasodya, W. A.
    Kumarawadu, S.
    Logeeshan, V
    Wanigasekara, C.
    2024 IEEE 5TH ANNUAL WORLD AI IOT CONGRESS, AIIOT 2024, 2024, : 0540 - 0545
  • [25] Research on non-intrusive unknown load identification technology based on deep learning
    Yin, Bo
    Zhao, Liwen
    Huang, Xianqing
    Zhang, Ying
    Du, Zehua
    International Journal of Electrical Power and Energy Systems, 2021, 131
  • [26] Evaluation of Deep Learning-Based Non-Intrusive Thermal Load Monitoring
    Okazawa, Kazuki
    Kaneko, Naoya
    Zhao, Dafang
    Nishikawa, Hiroki
    Taniguchi, Ittetsu
    Catthoor, Francky
    Onoye, Takao
    ENERGIES, 2024, 17 (09)
  • [27] Non-intrusive Load Disaggregation Method Based on Edge Embedded Deep Learning
    Liu Y.
    Sun Y.
    Li B.
    Huang T.
    Dianwang Jishu/Power System Technology, 2019, 43 (12): : 4329 - 4336
  • [28] Tracking Defective Panel on Photovoltaic Strings with Non-Intrusive Monitoring and Deep Learning
    Rocha, Helder R. O.
    Silva, Andre
    Coura, Daniel J. C.
    Silvestre, Leonardo J.
    Junior, Luis O. Rigo
    Silva, Jair A. L.
    Celeste, Wanderley C.
    JOURNAL OF CONTROL AUTOMATION AND ELECTRICAL SYSTEMS, 2024, 35 (04) : 688 - 701
  • [29] Research on non-intrusive unknown load identification technology based on deep learning
    Yin, Bo
    Zhao, Liwen
    Huang, Xianqing
    Zhang, Ying
    Du, Zehua
    INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2021, 131
  • [30] PrecisionProbe: Non-intrusive Performance Analysis Tool for Deep Learning Recommendation Models
    Peng, Weiyu
    Wang, Jinghao
    Wo, Tianyu
    Yang, Renyu
    2024 IEEE INTERNATIONAL CONFERENCE ON JOINT CLOUD COMPUTING, JCC, 2024, : 17 - 20