Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

被引：51

作者：

Gu, Rong ^{[1
]}

Chen, Yuquan ^{[1
]}

Liu, Shuai ^{[1
]}

Dai, Haipeng ^{[1
]}

Chen, Guihai

Zhang, Kai ^{[2
]}

Che, Yang ^{[1
,2
]}

Huang, Yihua ^{[1
]}

机构：

[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China

[2] Alibaba Grp, Hangzhou 311121, Zhejiang, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 11期

基金：

美国国家科学基金会;

关键词：

Graphics processing units; Processor scheduling; Resource management; Estimation; Liquids; Optimization; Training; Job scheduling; resource management; deep learning; GPU clusters;

D O I：

10.1109/TPDS.2021.3138825

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Deep learning (DL) is becoming increasingly popular in many domains, including computer vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently but is expensive, which motivates users to share GPU resource to reduce money costs in practice. To ensure efficient sharing among multiple users, it is necessary to develop efficient GPU resource management and scheduling solutions. However, existing ones have several shortcomings. First, they require the users to specify the job resource requirement which is usually quite inaccurate and leads to cluster resource underutilization. Second, when scheduling DL jobs, they rarely take the cluster network characteristics into consideration, resulting in low job execution performance. To overcome the above issues, we propose Liquid, an efficient GPU resource management platform for DL jobs with intelligent resource requirement estimation and scheduling. First, we propose a regression model based method for job resource requirement estimation to avoid users over-allocating computing resources. Second, we propose intelligent cluster network-efficient scheduling methods in both immediate and batch modes based on the above resource requirement estimation techniques. Third, we further propose three system-level optimizations, including pre-scheduling data transmission, fine-grained GPU sharing, and event-driven communication. Experimental results show that our Liquid can accelerate the job execution speed by 18% on average and shorten the average job completion time (JCT) by 21% compared with cutting-edge solutions. Moreover, the proposed optimization methods are effective in various scenarios.

引用

页码：2808 / 2820

页数：13

共 50 条

[1] DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment
Qiao, Wei
Li, Ying
Wu, Zhong-Hai
4TH ANNUAL INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND APPLICATIONS (ITA 2017), 2017, 12
[2] Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing
Luo, Yizhou
Wang, Qiang
Shi, Shaohuai
Lai, Jiaxin
Qi, Shuhan
Zhang, Jiajia
Wang, Xuan
2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS, 2024,
[3] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
Zhang, Hao
Zheng, Zeyu
Xu, Shizhen
Dai, Wei
Hoe, Qirong
Liang, Xiaodan
Hu, Zhiting
Weil, Jinliang
Xie, Pengtao
Xing, Eric P.
2017 USENIX ANNUAL TECHNICAL CONFERENCE (USENIX ATC '17), 2017, : 181 - 193
[4] Scheduling CPU for GPU-based Deep Learning Jobs
Xiao, Wencong
Han, Zhenhua
Zhao, Hanyu
Peng, Xuan
Zhang, Quanlu
Yang, Fan
Zhou, Lidong
PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 503 - 503
[5] PickyMan: A Preemptive Scheduler for Deep Learning Jobs on GPU Clusters
Chen, Chen
Chen, Yingwen
Chen, Zhaoyun
Han, Jianchen
Xue, Guangtao
2022 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE, IPCCC, 2022,
[6] Cooperative Distributed GPU Power Capping for Deep Learning Clusters
Kang, Dong-Ki
Ha, Yun-Gi
Peng, Limei
Youn, Chan-Hyun
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2022, 69 (07) : 7244 - 7254
[7] Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters
Chen, Zhaoyun
Luo, Lei
Quan, Wei
Wen, Mei
Zhang, Chunyuan
IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (IEEE INFOCOM 2019 WKSHPS), 2019, : 1023 - 1024
[8] GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clusters
Wang, Sheng
Chen, Shiping
Shi, Yumei
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 152 : 127 - 137
[9] Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters
Hsieh, Tsung-Tso
Lee, Che-Rung
2023 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING, IC2E, 2023, : 131 - 140
[10] Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness
Li, Qingping
Xu, Jingwei
Cao, Chun
THE 12TH ASIA-PACIFIC SYMPOSIUM ON INTERNETWARE, INTERNETWARE 2020, 2021, : 217 - 228

← 1 2 3 4 5 →