A fine-grained GPU sharing and job scheduling for deep learning jobs on the cloud

被引:0
|
作者
Chung, Wu-Chun [1 ]
Tong, Jyun-Sen [1 ]
Chen, Zhi-Hao [1 ]
机构
[1] Chung Yuan Christian Univ, Dept Informat & Comp Engn, Taoyuan 320, Taiwan
来源
JOURNAL OF SUPERCOMPUTING | 2025年 / 81卷 / 02期
关键词
Deep learning; GPU sharing; Resource allocation; Job scheduling; Cloud computing;
D O I
10.1007/s11227-024-06849-5
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces an innovative GPU sharing and scheduling method to tackle resource wastage and underutilization in deep learning training jobs. Existing methods rely on execution time estimation models and rarely explore fine-grained GPU sharing. Our approach leverages a suspend and resume mechanism to save and migrate model training states. With a lightweight sampling analysis to predict job completion times, the proposed method tackles large job starvation and reuses fragmented resources. By efficiently utilizing fragmented resources, the scheduler reduces job completion and waiting times. Performances are evaluated using Microsoft Philly data and TF-Slim benchmarks on four image classification models to demonstrate significant improvements. Compared to traditional methods, our approach increases resource utilization by 4.1 times and reduces completion time by 3.6 times. The proposed method significantly enhances deep learning training efficiency and optimizes idle GPU resource usage, providing a flexible and efficient solution for future training needs.
引用
收藏
页数:30
相关论文
共 50 条
  • [21] Interpreting Fine-Grained Dermatological Classification by Deep Learning
    Mishra, Sourav
    Imaizumi, Hideaki
    Yamasaki, Toshihiko
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 2729 - 2737
  • [22] Learning Fine-grained Image Similarity with Deep Ranking
    Wang, Jiang
    Song, Yang
    Leung, Thomas
    Rosenberg, Chuck
    Wang, Jingbin
    Philbin, James
    Chen, Bo
    Wu, Ying
    2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 1386 - 1393
  • [23] Fine-Grained Visual Computing Based on Deep Learning
    Lv, Zhihan
    Qiao, Liang
    Singh, Amit Kumar
    Wang, Qingjun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (01)
  • [24] Grouping-based Scheduling with Load Balancing for Fine-Grained Jobs in Grid Computing
    Ezzat, Rabab Mohamed
    Aboutabl, Amal Elsayed
    Mostafa, Mostafa Sami
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2013, 4 (11) : 67 - 75
  • [25] Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers
    Peng, Yaqiong
    Gao, Weiguo
    Peng, Haocheng
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2024, 17 (06) : 4310 - 4323
  • [26] A Fine-grained Performance Model for GPU Architectures
    Bombieri, Nicola
    Busato, Federico
    Fummi, Franco
    PROCEEDINGS OF THE 2016 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2016, : 1267 - 1272
  • [27] TurboDL: Improving the CNN Training on GPU With Fine-Grained Multi-Streaming Scheduling
    Jin, Hai
    Wu, Wenchao
    Shi, Xuanhua
    He, Ligang
    Zhou, Bing Bing
    IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (04) : 552 - 565
  • [28] Secure Fine-Grained Access Control and Data Sharing for Dynamic Groups in the Cloud
    Xu, Shengmin
    Yang, Guomin
    Mu, Yi
    Deng, Robert H.
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2018, 13 (08) : 2101 - 2113
  • [29] Cloud based data sharing with fine-grained proxy re-encryption
    Yang, Yanjiang
    Zhu, Haiyan
    Lu, Haibing
    Weng, Jian
    Zhang, Youcheng
    Choo, Kim-Kwang Raymond
    PERVASIVE AND MOBILE COMPUTING, 2016, 28 : 122 - 134
  • [30] Warp Scheduling for Fine-Grained Synchronization
    ElTantawy, Ahmed
    Aamodt, Tor M.
    2018 24TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2018, : 375 - 388