A fine-grained GPU sharing and job scheduling for deep learning jobs on the cloud

被引:0
|
作者
Chung, Wu-Chun [1 ]
Tong, Jyun-Sen [1 ]
Chen, Zhi-Hao [1 ]
机构
[1] Chung Yuan Christian Univ, Dept Informat & Comp Engn, Taoyuan 320, Taiwan
来源
JOURNAL OF SUPERCOMPUTING | 2025年 / 81卷 / 02期
关键词
Deep learning; GPU sharing; Resource allocation; Job scheduling; Cloud computing;
D O I
10.1007/s11227-024-06849-5
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces an innovative GPU sharing and scheduling method to tackle resource wastage and underutilization in deep learning training jobs. Existing methods rely on execution time estimation models and rarely explore fine-grained GPU sharing. Our approach leverages a suspend and resume mechanism to save and migrate model training states. With a lightweight sampling analysis to predict job completion times, the proposed method tackles large job starvation and reuses fragmented resources. By efficiently utilizing fragmented resources, the scheduler reduces job completion and waiting times. Performances are evaluated using Microsoft Philly data and TF-Slim benchmarks on four image classification models to demonstrate significant improvements. Compared to traditional methods, our approach increases resource utilization by 4.1 times and reduces completion time by 3.6 times. The proposed method significantly enhances deep learning training efficiency and optimizes idle GPU resource usage, providing a flexible and efficient solution for future training needs.
引用
收藏
页数:30
相关论文
共 50 条
  • [1] Efficient Sharing and Fine-Grained Scheduling of Virtualized GPU Resources
    Zhao, Xiaohui
    Yao, Jianguo
    Gao, Ping
    Guan, Haibing
    2018 IEEE 38TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2018, : 742 - 752
  • [2] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
    Tan, Xiaodan Serina
    Golikov, Pavel
    Vijaykumar, Nandita
    Pekhimenko, Gennady
    PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022, 2022, : 317 - 332
  • [3] Fine-Grained Scheduling in Cloud Gaming on Heterogeneous CPU-GPU Clusters
    Zhang, Wei
    Liao, Xiaofei
    Li, Peng
    Jin, Hai
    Lin, Li
    Zhou, Bing Bing
    IEEE NETWORK, 2018, 32 (01): : 172 - 178
  • [4] ShareRender: Bypassing GPU Virtualization to Enable Fine-grained Resource Sharing for Cloud Gaming
    Zhang, Wei
    Liao, Xiaofei
    Li, Peng
    Jin, Hai
    Lin, Li
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 324 - 332
  • [5] Advice Complexity of Fine-Grained Job Shop Scheduling
    Wehner, David
    ALGORITHMS AND COMPLEXITY (CIAC 2015), 2015, 9079 : 416 - 428
  • [6] Hierarchical Bucket Queuing for Fine-Grained Priority Scheduling on the GPU
    Kerbl, Bernhard
    Kenzel, Michael
    Schmalstieg, Dieter
    Seidel, Hans-Peter
    Steinberger, Markus
    COMPUTER GRAPHICS FORUM, 2017, 36 (08) : 232 - 246
  • [7] Dynamically Fine-grained Scheduling Method in Cloud Environment
    Zhou M.-S.
    Dong X.-S.
    Chen H.
    Zhang X.-J.
    Ruan Jian Xue Bao/Journal of Software, 2020, 31 (12): : 3981 - 3999
  • [8] Nimblock: Scheduling for Fine-grained FPGA Sharing through Virtualization
    Mandava, Meghna
    Reckamp, Paul
    Chen, Deming
    PROCEEDINGS OF THE 2023 THE 50TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, ISCA 2023, 2023, : 843 - 855
  • [9] Fine-Grained Data Sharing in Cloud Computing for Mobile Devices
    Shao, Jun
    Lu, Rongxing
    Lin, Xiaodong
    2015 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (INFOCOM), 2015,
  • [10] Fast and Fine-grained Autoscaler for Streaming Jobs with Reinforcement Learning
    Xing, Mingzhe
    Mao, Hangyu
    Xiao, Zhen
    PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, 2022, : 564 - 570