A fine-grained GPU sharing and job scheduling for deep learning jobs on the cloud

被引：0

作者：

Chung, Wu-Chun ^{[1
]}

Tong, Jyun-Sen ^{[1
]}

Chen, Zhi-Hao ^{[1
]}

机构：

[1] Chung Yuan Christian Univ, Dept Informat & Comp Engn, Taoyuan 320, Taiwan

来源：

JOURNAL OF SUPERCOMPUTING | 2025年 / 81卷 / 02期

关键词：

Deep learning; GPU sharing; Resource allocation; Job scheduling; Cloud computing;

D O I：

10.1007/s11227-024-06849-5

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper introduces an innovative GPU sharing and scheduling method to tackle resource wastage and underutilization in deep learning training jobs. Existing methods rely on execution time estimation models and rarely explore fine-grained GPU sharing. Our approach leverages a suspend and resume mechanism to save and migrate model training states. With a lightweight sampling analysis to predict job completion times, the proposed method tackles large job starvation and reuses fragmented resources. By efficiently utilizing fragmented resources, the scheduler reduces job completion and waiting times. Performances are evaluated using Microsoft Philly data and TF-Slim benchmarks on four image classification models to demonstrate significant improvements. Compared to traditional methods, our approach increases resource utilization by 4.1 times and reduces completion time by 3.6 times. The proposed method significantly enhances deep learning training efficiency and optimizes idle GPU resource usage, providing a flexible and efficient solution for future training needs.

引用

页数：30

共 50 条

[21] Interpreting Fine-Grained Dermatological Classification by Deep Learning
Mishra, Sourav
Imaizumi, Hideaki
Yamasaki, Toshihiko
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 2729 - 2737
[22] Learning Fine-grained Image Similarity with Deep Ranking
Wang, Jiang
Song, Yang
Leung, Thomas
Rosenberg, Chuck
Wang, Jingbin
Philbin, James
Chen, Bo
Wu, Ying
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 1386 - 1393
[23] Fine-Grained Visual Computing Based on Deep Learning
Lv, Zhihan
Qiao, Liang
Singh, Amit Kumar
Wang, Qingjun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (01)
[24] Grouping-based Scheduling with Load Balancing for Fine-Grained Jobs in Grid Computing
Ezzat, Rabab Mohamed
Aboutabl, Amal Elsayed
Mostafa, Mostafa Sami
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2013, 4 (11) : 67 - 75
[25] Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers
Peng, Yaqiong
Gao, Weiguo
Peng, Haocheng
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2024, 17 (06) : 4310 - 4323
[26] A Fine-grained Performance Model for GPU Architectures
Bombieri, Nicola
Busato, Federico
Fummi, Franco
PROCEEDINGS OF THE 2016 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2016, : 1267 - 1272
[27] TurboDL: Improving the CNN Training on GPU With Fine-Grained Multi-Streaming Scheduling
Jin, Hai
Wu, Wenchao
Shi, Xuanhua
He, Ligang
Zhou, Bing Bing
IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (04) : 552 - 565
[28] Secure Fine-Grained Access Control and Data Sharing for Dynamic Groups in the Cloud
Xu, Shengmin
Yang, Guomin
Mu, Yi
Deng, Robert H.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2018, 13 (08) : 2101 - 2113
[29] Cloud based data sharing with fine-grained proxy re-encryption
Yang, Yanjiang
Zhu, Haiyan
Lu, Haibing
Weng, Jian
Zhang, Youcheng
Choo, Kim-Kwang Raymond
PERVASIVE AND MOBILE COMPUTING, 2016, 28 : 122 - 134
[30] Warp Scheduling for Fine-Grained Synchronization
ElTantawy, Ahmed
Aamodt, Tor M.
2018 24TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2018, : 375 - 388

← 1 2 3 4 5 →